Merge pull request #855 from irfancharania/Update_WebsiteAgent_Description

WebsiteAgent: Add instruction for extracting inner and outer HTML

Akinori MUSHA 9 years ago
parent
commit
f4312f6ad8
1 changed files with 3 additions and 1 deletions
  1. 3 1
      app/models/agents/website_agent.rb

+ 3 - 1
app/models/agents/website_agent.rb

@@ -31,7 +31,9 @@ module Agents
31 31
             "body_text": { "css": "div.main", "value": ".//text()" }
32 32
           }
33 33
 
34
-      "@_attr_" is the XPath expression to extract the value of an attribute named _attr_ from a node, and ".//text()" is to extract all the enclosed texts.  You can also use [XPath functions](http://www.w3.org/TR/xpath/#section-String-Functions) like `normalize-space` to strip and squeeze whitespace, `substring-after` to extract part of a text, and `translate` to remove comma from a formatted number, etc.  Note that these functions take a string, not a node set, so what you may think would be written as `normalize-space(.//text())` should actually be `normalize-space(.)`.
34
+      "@_attr_" is the XPath expression to extract the value of an attribute named _attr_ from a node, and ".//text()" is to extract all the enclosed texts. To extract the innerHTML, use "./node()"; and to extract the outer HTML, use  ".". 
35
+
36
+      You can also use [XPath functions](http://www.w3.org/TR/xpath/#section-String-Functions) like `normalize-space` to strip and squeeze whitespace, `substring-after` to extract part of a text, and `translate` to remove comma from a formatted number, etc.  Note that these functions take a string, not a node set, so what you may think would be written as `normalize-space(.//text())` should actually be `normalize-space(.)`.
35 37
 
36 38
       Beware that when parsing an XML document (i.e. `type` is `xml`) using `xpath` expressions all namespaces are stripped from the document unless a toplevel option `use_namespaces` is set to true.
37 39