@@ -16,16 +16,15 @@ module Agents |
||
16 | 16 |
description <<-MD |
17 | 17 |
The Website Agent scrapes a website, XML document, or JSON feed and creates Events based on the results. |
18 | 18 |
|
19 |
- Specify a `url` and select a `mode` for when to create Events based on the scraped data, either `all` or `on_change`. |
|
19 |
+ Specify a `url` and select a `mode` for when to create Events based on the scraped data, either `all`, `on_change`, or `merge` (if fetching based on an Event, see below). |
|
20 | 20 |
|
21 |
- `url` can be a single url, or an array of urls (for example, for multiple pages with the exact same structure but different content to scrape) |
|
21 |
+ The `url` option can be a single url, or an array of urls (for example, for multiple pages with the exact same structure but different content to scrape). |
|
22 | 22 |
|
23 | 23 |
The WebsiteAgent can also scrape based on incoming events. |
24 | 24 |
|
25 |
- * If the Event contains a `url` key, that URL will be fetched. |
|
26 |
- * For more control, you can set the `url_from_event` option and it will be used as a Liquid template to generate the url to access based on the Event. |
|
27 |
- * If you set `data_from_event` to a Liquid template, it will be used to generate the data directly without fetching any URL. (For example, set it to `{{ html }}` to use HTML contained in the `html` key of the incoming Event.) |
|
28 |
- * If you specify `merge` for the `mode` option, Huginn will retain the old payload and update it with the new values. |
|
25 |
+ * Set the `url_from_event` option to a Liquid template to generate the url to access based on the Event. (To fetch the url in the Event's `url` key, for example, set `url_from_event` to `{{ url }}`.) |
|
26 |
+ * Alternatively, set `data_from_event` to a Liquid template to use data directly without fetching any URL. (For example, set it to `{{ html }}` to use HTML contained in the `html` key of the incoming Event.) |
|
27 |
+ * If you specify `merge` for the `mode` option, Huginn will retain the old payload and update it with new values. |
|
29 | 28 |
|
30 | 29 |
# Supported Document Types |
31 | 30 |
|
@@ -37,7 +36,7 @@ module Agents |
||
37 | 36 |
|
38 | 37 |
# Scraping HTML and XML |
39 | 38 |
|
40 |
- When parsing HTML or XML, these sub-hashes specify how each extraction should be done. The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in `css` or an XPath expression in `xpath`. It then evaluates an XPath expression in `value` (default: `.`) on each node in the node set, converting the result into string. Here's an example: |
|
39 |
+ When parsing HTML or XML, these sub-hashes specify how each extraction should be done. The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in `css` or an XPath expression in `xpath`. It then evaluates an XPath expression in `value` (default: `.`) on each node in the node set, converting the result into a string. Here's an example: |
|
41 | 40 |
|
42 | 41 |
"extract": { |
43 | 42 |
"url": { "css": "#comic img", "value": "@src" }, |
@@ -45,11 +44,11 @@ module Agents |
||
45 | 44 |
"body_text": { "css": "div.main", "value": ".//text()" } |
46 | 45 |
} |
47 | 46 |
|
48 |
- "@_attr_" is the XPath expression to extract the value of an attribute named _attr_ from a node, and ".//text()" is to extract all the enclosed texts. To extract the innerHTML, use "./node()"; and to extract the outer HTML, use ".". |
|
47 |
+ "@_attr_" is the XPath expression to extract the value of an attribute named _attr_ from a node, and `.//text()` extracts all the enclosed text. To extract the innerHTML, use `./node()`; and to extract the outer HTML, use `.`. |
|
49 | 48 |
|
50 |
- You can also use [XPath functions](http://www.w3.org/TR/xpath/#section-String-Functions) like `normalize-space` to strip and squeeze whitespace, `substring-after` to extract part of a text, and `translate` to remove comma from a formatted number, etc. Note that these functions take a string, not a node set, so what you may think would be written as `normalize-space(.//text())` should actually be `normalize-space(.)`. |
|
49 |
+ You can also use [XPath functions](http://www.w3.org/TR/xpath/#section-String-Functions) like `normalize-space` to strip and squeeze whitespace, `substring-after` to extract part of a text, and `translate` to remove commas from formatted numbers, etc. Note that these functions take a string, not a node set, so what you may think would be written as `normalize-space(.//text())` should actually be `normalize-space(.)`. |
|
51 | 50 |
|
52 |
- Beware that when parsing an XML document (i.e. `type` is `xml`) using `xpath` expressions all namespaces are stripped from the document unless a toplevel option `use_namespaces` is set to true. |
|
51 |
+ Beware that when parsing an XML document (i.e. `type` is `xml`) using `xpath` expressions, all namespaces are stripped from the document unless the top-level option `use_namespaces` is set to `true`. |
|
53 | 52 |
|
54 | 53 |
# Scraping JSON |
55 | 54 |
|
@@ -92,7 +91,7 @@ module Agents |
||
92 | 91 |
|
93 | 92 |
Set `uniqueness_look_back` to limit the number of events checked for uniqueness (typically for performance). This defaults to the larger of #{UNIQUENESS_LOOK_BACK} or #{UNIQUENESS_FACTOR}x the number of detected received results. |
94 | 93 |
|
95 |
- Set `force_encoding` to an encoding name if the website is known to respond with a missing, invalid or wrong charset in the Content-Type header. Note that a text content without a charset is taken as encoded in UTF-8 (not ISO-8859-1). |
|
94 |
+ Set `force_encoding` to an encoding name if the website is known to respond with a missing, invalid, or wrong charset in the Content-Type header. Note that a text content without a charset is taken as encoded in UTF-8 (not ISO-8859-1). |
|
96 | 95 |
|
97 | 96 |
Set `user_agent` to a custom User-Agent name if the website does not like the default value (`#{default_user_agent}`). |
98 | 97 |
|
@@ -343,7 +342,7 @@ module Agents |
||
343 | 342 |
if url_template = options['url_from_event'].presence |
344 | 343 |
interpolate_options(url_template) |
345 | 344 |
else |
346 |
- event.payload['url'] |
|
345 |
+ interpolated['url'] |
|
347 | 346 |
end |
348 | 347 |
check_urls(url_to_scrape, existing_payload) |
349 | 348 |
end |
@@ -0,0 +1,22 @@ |
||
1 |
+class WebsiteAgentDoesNotUseEventUrl < ActiveRecord::Migration |
|
2 |
+ def up |
|
3 |
+ # Until this migration, if a WebsiteAgent received Events and did not have a `url_from_event` option set, |
|
4 |
+ # it would use the `url` from the Event's payload. If the Event did not have a `url` in its payload, the |
|
5 |
+ # WebsiteAgent would do nothing. This migration assumes that if someone has wired a WebsiteAgent to receive Events |
|
6 |
+ # and has not set `url_from_event` or `data_from_event`, they were trying to use the Event's `url` payload, so we |
|
7 |
+ # set `url_from_event` to `{{ url }}` for them. |
|
8 |
+ Agents::WebsiteAgent.find_each do |agent| |
|
9 |
+ next unless agent.sources.count > 0 |
|
10 |
+ |
|
11 |
+ if !agent.options['data_from_event'].present? && !agent.options['url_from_event'].present? |
|
12 |
+ agent.options['url_from_event'] = '{{ url }}' |
|
13 |
+ agent.save! |
|
14 |
+ puts ">> Setting `url_from_event` on WebsiteAgent##{agent.id} to {{ url }} because it is wired" |
|
15 |
+ puts ">> to receive Events, and the WebsiteAgent no longer uses the Event's `url` value directly." |
|
16 |
+ end |
|
17 |
+ end |
|
18 |
+ end |
|
19 |
+ |
|
20 |
+ def down |
|
21 |
+ end |
|
22 |
+end |
@@ -397,6 +397,8 @@ describe AgentsController do |
||
397 | 397 |
it "accepts an event" do |
398 | 398 |
sign_in users(:bob) |
399 | 399 |
agent = agents(:bob_website_agent) |
400 |
+ agent.options['url_from_event'] = '{{ url }}' |
|
401 |
+ agent.save! |
|
400 | 402 |
url_from_event = "http://xkcd.com/?from_event=1".freeze |
401 | 403 |
expect { |
402 | 404 |
post :dry_run, id: agent, event: { url: url_from_event } |
@@ -768,20 +768,13 @@ fire: hot |
||
768 | 768 |
@event = Event.new |
769 | 769 |
@event.agent = agents(:bob_rain_notifier_agent) |
770 | 770 |
@event.payload = { |
771 |
- 'url' => 'http://xkcd.com', |
|
772 |
- 'link' => 'Random', |
|
771 |
+ 'url' => 'http://foo.com', |
|
772 |
+ 'link' => 'Random' |
|
773 | 773 |
} |
774 | 774 |
end |
775 | 775 |
|
776 |
- it "should scrape from the url element in incoming event payload" do |
|
777 |
- expect { |
|
778 |
- @checker.options = @valid_options |
|
779 |
- @checker.receive([@event]) |
|
780 |
- }.to change { Event.count }.by(1) |
|
781 |
- end |
|
782 |
- |
|
783 |
- it "should use url_from_event as url to scrape if it exists when receiving an event" do |
|
784 |
- stub = stub_request(:any, 'http://example.org/?url=http%3A%2F%2Fxkcd.com') |
|
776 |
+ it "should use url_from_event as the url to scrape" do |
|
777 |
+ stub = stub_request(:any, 'http://example.org/?url=http%3A%2F%2Ffoo.com') |
|
785 | 778 |
|
786 | 779 |
@checker.options = @valid_options.merge( |
787 | 780 |
'url_from_event' => 'http://example.org/?url={{url | uri_escape}}' |
@@ -791,9 +784,16 @@ fire: hot |
||
791 | 784 |
expect(stub).to have_been_requested |
792 | 785 |
end |
793 | 786 |
|
787 |
+ it "should use the Agent's `url` option if url_from_event is not set" do |
|
788 |
+ expect { |
|
789 |
+ @checker.options = @valid_options |
|
790 |
+ @checker.receive([@event]) |
|
791 |
+ }.to change { Event.count }.by(1) |
|
792 |
+ end |
|
793 |
+ |
|
794 | 794 |
it "should allow url_from_event to be an array of urls" do |
795 |
- stub1 = stub_request(:any, 'http://example.org/?url=http%3A%2F%2Fxkcd.com') |
|
796 |
- stub2 = stub_request(:any, 'http://google.org/?url=http%3A%2F%2Fxkcd.com') |
|
795 |
+ stub1 = stub_request(:any, 'http://example.org/?url=http%3A%2F%2Ffoo.com') |
|
796 |
+ stub2 = stub_request(:any, 'http://google.org/?url=http%3A%2F%2Ffoo.com') |
|
797 | 797 |
|
798 | 798 |
@checker.options = @valid_options.merge( |
799 | 799 |
'url_from_event' => ['http://example.org/?url={{url | uri_escape}}', 'http://google.org/?url={{url | uri_escape}}'] |
@@ -805,7 +805,10 @@ fire: hot |
||
805 | 805 |
end |
806 | 806 |
|
807 | 807 |
it "should interpolate values from incoming event payload" do |
808 |
+ stub_request(:any, /foo/).to_return(body: File.read(Rails.root.join("spec/data_fixtures/xkcd.html")), status: 200) |
|
809 |
+ |
|
808 | 810 |
expect { |
811 |
+ @valid_options['url_from_event'] = '{{ url }}' |
|
809 | 812 |
@valid_options['extract'] = { |
810 | 813 |
'from' => { |
811 | 814 |
'xpath' => '*[1]', |
@@ -821,11 +824,21 @@ fire: hot |
||
821 | 824 |
}.to change { Event.count }.by(1) |
822 | 825 |
|
823 | 826 |
expect(Event.last.payload).to eq({ |
824 |
- 'from' => 'http://xkcd.com', |
|
827 |
+ 'from' => 'http://foo.com', |
|
825 | 828 |
'to' => 'http://dynamic.xkcd.com/random/comic/', |
826 | 829 |
}) |
827 | 830 |
end |
828 | 831 |
|
832 |
+ it "should use the options url if no url is in the event payload, and `url_from_event` is not provided" do |
|
833 |
+ @checker.options['mode'] = 'merge' |
|
834 |
+ @event.payload.delete('url') |
|
835 |
+ expect { |
|
836 |
+ @checker.receive([@event]) |
|
837 |
+ }.to change { Event.count }.by(1) |
|
838 |
+ expect(Event.last.payload['title']).to eq('Evolving') |
|
839 |
+ expect(Event.last.payload['link']).to eq('Random') |
|
840 |
+ end |
|
841 |
+ |
|
829 | 842 |
it "should interpolate values from incoming event payload and _response_" do |
830 | 843 |
@event.payload['title'] = 'XKCD' |
831 | 844 |
|
@@ -1065,7 +1078,6 @@ fire: hot |
||
1065 | 1078 |
event = @events[6] |
1066 | 1079 |
expect(event.payload['url']).to eq("https://www.google.ca/search?q=%EC%9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC:%EB%8C%80%EB%AC%B8") |
1067 | 1080 |
end |
1068 |
- |
|
1069 | 1081 |
end |
1070 | 1082 |
end |
1071 | 1083 |
end |