Organize the description for WebsiteAgent into sections

Akinori MUSHA 9 年之前
父节点
当前提交
63583d4286
共有 1 个文件被更改,包括 14 次插入2 次删除
  1. 14 2
      app/models/agents/website_agent.rb

+ 14 - 2
app/models/agents/website_agent.rb

@@ -19,10 +19,18 @@ module Agents
19 19
 
20 20
       `url` can be a single url, or an array of urls (for example, for multiple pages with the exact same structure but different content to scrape)
21 21
 
22
+      The WebsiteAgent can also scrape based on incoming events. It will scrape the url contained in the `url` key of the incoming event payload. If you specify `merge` as the `mode`, it will retain the old payload and update it with the new values.
23
+
24
+      # Supported Document Types
25
+
22 26
       The `type` value can be `xml`, `html`, `json`, or `text`.
23 27
 
24 28
       To tell the Agent how to parse the content, specify `extract` as a hash with keys naming the extractions and values of hashes.
25 29
 
30
+      Note that for all of the formats, whatever you extract MUST have the same number of matches for each extractor.  E.g., if you're extracting rows, all extractors must match all rows.  For generating CSS selectors, something like [SelectorGadget](http://selectorgadget.com) may be helpful.
31
+
32
+      # Scraping HTML and XML
33
+
26 34
       When parsing HTML or XML, these sub-hashes specify how each extraction should be done.  The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in `css` or an XPath expression in `xpath`.  It then evaluates an XPath expression in `value` on each node in the node set, converting the result into string.  Here's an example:
27 35
 
28 36
           "extract": {
@@ -37,6 +45,8 @@ module Agents
37 45
 
38 46
       Beware that when parsing an XML document (i.e. `type` is `xml`) using `xpath` expressions all namespaces are stripped from the document unless a toplevel option `use_namespaces` is set to true.
39 47
 
48
+      # Scraping JSON
49
+
40 50
       When parsing JSON, these sub-hashes specify [JSONPaths](http://goessner.net/articles/JsonPath/) to the values that you care about.  For example:
41 51
 
42 52
           "extract": {
@@ -44,6 +54,8 @@ module Agents
44 54
             "description": { "path": "results.data[*].description" }
45 55
           }
46 56
 
57
+      # Scraping Text
58
+
47 59
       When parsing text, each sub-hash should contain a `regexp` and `index`.  Output text is matched against the regular expression repeatedly from the beginning through to the end, collecting a captured group specified by `index` in each match.  Each index should be either an integer or a string name which corresponds to <code>(?&lt;<em>name</em>&gt;...)</code>.  For example, to parse lines of <code><em>word</em>: <em>definition</em></code>, the following should work:
48 60
 
49 61
           "extract": {
@@ -66,7 +78,7 @@ module Agents
66 78
 
67 79
       Beware that `.` does not match the newline character (LF) unless the `m` flag is in effect, and `^`/`$` basically match every line beginning/end.  See [this document](http://ruby-doc.org/core-#{RUBY_VERSION}/doc/regexp_rdoc.html) to learn the regular expression variant used in this service.
68 80
 
69
-      Note that for all of the formats, whatever you extract MUST have the same number of matches for each extractor.  E.g., if you're extracting rows, all extractors must match all rows.  For generating CSS selectors, something like [SelectorGadget](http://selectorgadget.com) may be helpful.
81
+      # General Options
70 82
 
71 83
       Can be configured to use HTTP basic auth by including the `basic_auth` parameter with `"username:password"`, or `["username", "password"]`.
72 84
 
@@ -84,7 +96,7 @@ module Agents
84 96
 
85 97
       Set `unzip` to `gzip` to inflate the resource using gzip.
86 98
 
87
-      The WebsiteAgent can also scrape based on incoming events. It will scrape the url contained in the `url` key of the incoming event payload. If you specify `merge` as the mode, it will retain the old payload and update it with the new values.
99
+      # Liquid Templating
88 100
 
89 101
       In Liquid templating, the following variable is available:
90 102