{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Housing Prices in São Paulo\n", "\n", "This notebook gathers information about housing prices and their sizes on the city of São Paulo, Brazil." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Import Libraries\n", "import pandas as pd\n", "import requests\n", "from bs4 import BeautifulSoup\n", "import matplotlib.pyplot as plt\n", "import matplotlib\n", "matplotlib.style.use('ggplot')\n", "%matplotlib notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset is gathering information from [Imovel Web](http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-1.html), a brazilian online real estate portal. The function belows creates a new URL in each loop iteration." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [], "source": [ "def getURL(page_number):\n", " base_url = \"http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-\"\n", " end_url = \".html\"\n", " url = base_url + str(page_number) + end_url\n", " return url" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def num(s):\n", " try:\n", " return int(s)\n", " except ValueError:\n", " return float(s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function below requests a url, passes the page to [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) that in turn scrapes each data item from a this page. Each new item is added to a [pandas](https://pandas.pydata.org) series that is then appended to a dataset for later use." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def grab_data(url, i):\n", " try:\n", " result = requests.get(url)\n", " page = BeautifulSoup(result.content, \"html5lib\")\n", " items = page.find_all('li', class_='post')\n", " for item in items:\n", " title = item.find(\"a\", class_='dl-aviso-link').get('title')\n", " price = item.find(\"span\", class_='precio-valor').string.replace(\"R$\",\"\").replace(\".\",\"\").strip()\n", " size = item.find(\"li\", class_='post-m2totales')\n", " if size is not None:\n", " size = size.text.replace(\"total\",\"\").strip()\n", " #print(size + \" - \" + price + \" - \" + title)\n", " price = num(str(price))/1000\n", " size = num(str(size.replace(\"m²\",\"\")))\n", " df.loc[i] = [size, price]\n", " i = i + 1\n", " return i\n", " except:\n", " print(\"--> ERROR\")\n", " return i" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is the actual program loop. It will grab data from *n* number of pages using the ```grab_data()``` function. While this is happening, the program prints the current URL that is beign scraped or prints an error message. If an error occurs, the program will continue scraping from the next link." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-1.html\n", "2 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-2.html\n", "3 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-3.html\n", "4 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-4.html\n", "5 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-5.html\n", "6 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-6.html\n", "7 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-7.html\n", "8 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-8.html\n", "9 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-9.html\n", "10 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-10.html\n", "11 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-11.html\n", "12 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-12.html\n", "13 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-13.html\n", "14 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-14.html\n", "15 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-15.html\n", "16 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-16.html\n", "--> ERROR\n", "17 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-17.html\n", "18 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-18.html\n", "19 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-19.html\n", "20 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-20.html\n", "21 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-21.html\n", "22 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-22.html\n", "23 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-23.html\n", "24 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-24.html\n", "25 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-25.html\n", "26 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-26.html\n", "27 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-27.html\n", "28 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-28.html\n", "29 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-29.html\n", "--> ERROR\n", "30 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-30.html\n", "31 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-31.html\n", "32 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-32.html\n", "33 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-33.html\n", "34 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-34.html\n", "35 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-35.html\n", "36 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-36.html\n", "37 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-37.html\n", "38 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-38.html\n", "39 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-39.html\n", "40 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-40.html\n", "41 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-41.html\n", "42 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-42.html\n", "43 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-43.html\n", "44 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-44.html\n", "45 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-45.html\n", "--> ERROR\n", "46 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-46.html\n", "47 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-47.html\n", "48 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-48.html\n", "49 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-49.html\n", "50 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-50.html\n", "51 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-51.html\n", "52 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-52.html\n", "53 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-53.html\n", "54 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-54.html\n", "55 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-55.html\n", "56 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-56.html\n", "57 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-57.html\n", "58 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-58.html\n", "59 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-59.html\n", "60 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-60.html\n", "61 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-61.html\n", "62 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-62.html\n", "63 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-63.html\n", "--> ERROR\n", "64 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-64.html\n", "65 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-65.html\n", "66 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-66.html\n", "67 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-67.html\n", "68 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-68.html\n", "69 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-69.html\n", "70 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-70.html\n", "71 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-71.html\n", "72 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-72.html\n", "73 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-73.html\n", "74 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-74.html\n", "75 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-75.html\n", "76 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-76.html\n", "77 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-77.html\n", "78 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-78.html\n", "79 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-79.html\n", "80 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-80.html\n", "81 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-81.html\n", "82 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-82.html\n", "83 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-83.html\n", "84 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-84.html\n", "85 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-85.html\n", "86 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-86.html\n", "--> ERROR\n", "87 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-87.html\n", "88 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-88.html\n", "--> ERROR\n", "89 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-89.html\n", "90 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-90.html\n", "91 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-91.html\n", "--> ERROR\n", "92 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-92.html\n", "93 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-93.html\n", "94 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-94.html\n", "95 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-95.html\n", "96 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-96.html\n", "97 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-97.html\n", "98 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-98.html\n", "99 - http://www.imovelweb.com.br/imoveis-venda-sao-paulo-sp-pagina-99.html\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sizeprice
108469.0480.0
1085103.0945.0
108656.0650.0
108781.0800.0
108870.035.0
\n", "
" ], "text/plain": [ " size price\n", "1084 69.0 480.0\n", "1085 103.0 945.0\n", "1086 56.0 650.0\n", "1087 81.0 800.0\n", "1088 70.0 35.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame([], columns=('size', 'price'))\n", "i = 0\n", "for page_number in range(1,100):\n", " url = getURL(page_number)\n", " print(str(page_number) + \" - \" + url)\n", " i = grab_data(url, i)\n", "df.tail() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next snippet creates a plot with the data gathered in the previous step." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "scrolled": false }, "outputs": [ { "data": { "application/javascript": [ "/* Put everything inside the global mpl namespace */\n", "window.mpl = {};\n", "\n", "\n", "mpl.get_websocket_type = function() {\n", " if (typeof(WebSocket) !== 'undefined') {\n", " return WebSocket;\n", " } else if (typeof(MozWebSocket) !== 'undefined') {\n", " return MozWebSocket;\n", " } else {\n", " alert('Your browser does not have WebSocket support.' +\n", " 'Please try Chrome, Safari or Firefox ≥ 6. ' +\n", " 'Firefox 4 and 5 are also supported but you ' +\n", " 'have to enable WebSockets in about:config.');\n", " };\n", "}\n", "\n", "mpl.figure = function(figure_id, websocket, ondownload, parent_element) {\n", " this.id = figure_id;\n", "\n", " this.ws = websocket;\n", "\n", " this.supports_binary = (this.ws.binaryType != undefined);\n", "\n", " if (!this.supports_binary) {\n", " var warnings = document.getElementById(\"mpl-warnings\");\n", " if (warnings) {\n", " warnings.style.display = 'block';\n", " warnings.textContent = (\n", " \"This browser does not support binary websocket messages. \" +\n", " \"Performance may be slow.\");\n", " }\n", " }\n", "\n", " this.imageObj = new Image();\n", "\n", " this.context = undefined;\n", " this.message = undefined;\n", " this.canvas = undefined;\n", " this.rubberband_canvas = undefined;\n", " this.rubberband_context = undefined;\n", " this.format_dropdown = undefined;\n", "\n", " this.image_mode = 'full';\n", "\n", " this.root = $('
');\n", " this._root_extra_style(this.root)\n", " this.root.attr('style', 'display: inline-block');\n", "\n", " $(parent_element).append(this.root);\n", "\n", " this._init_header(this);\n", " this._init_canvas(this);\n", " this._init_toolbar(this);\n", "\n", " var fig = this;\n", "\n", " this.waiting = false;\n", "\n", " this.ws.onopen = function () {\n", " fig.send_message(\"supports_binary\", {value: fig.supports_binary});\n", " fig.send_message(\"send_image_mode\", {});\n", " if (mpl.ratio != 1) {\n", " fig.send_message(\"set_dpi_ratio\", {'dpi_ratio': mpl.ratio});\n", " }\n", " fig.send_message(\"refresh\", {});\n", " }\n", "\n", " this.imageObj.onload = function() {\n", " if (fig.image_mode == 'full') {\n", " // Full images could contain transparency (where diff images\n", " // almost always do), so we need to clear the canvas so that\n", " // there is no ghosting.\n", " fig.context.clearRect(0, 0, fig.canvas.width, fig.canvas.height);\n", " }\n", " fig.context.drawImage(fig.imageObj, 0, 0);\n", " };\n", "\n", " this.imageObj.onunload = function() {\n", " this.ws.close();\n", " }\n", "\n", " this.ws.onmessage = this._make_on_message_function(this);\n", "\n", " this.ondownload = ondownload;\n", "}\n", "\n", "mpl.figure.prototype._init_header = function() {\n", " var titlebar = $(\n", " '
');\n", " var titletext = $(\n", " '
');\n", " titlebar.append(titletext)\n", " this.root.append(titlebar);\n", " this.header = titletext[0];\n", "}\n", "\n", "\n", "\n", "mpl.figure.prototype._canvas_extra_style = function(canvas_div) {\n", "\n", "}\n", "\n", "\n", "mpl.figure.prototype._root_extra_style = function(canvas_div) {\n", "\n", "}\n", "\n", "mpl.figure.prototype._init_canvas = function() {\n", " var fig = this;\n", "\n", " var canvas_div = $('
');\n", "\n", " canvas_div.attr('style', 'position: relative; clear: both; outline: 0');\n", "\n", " function canvas_keyboard_event(event) {\n", " return fig.key_event(event, event['data']);\n", " }\n", "\n", " canvas_div.keydown('key_press', canvas_keyboard_event);\n", " canvas_div.keyup('key_release', canvas_keyboard_event);\n", " this.canvas_div = canvas_div\n", " this._canvas_extra_style(canvas_div)\n", " this.root.append(canvas_div);\n", "\n", " var canvas = $('');\n", " canvas.addClass('mpl-canvas');\n", " canvas.attr('style', \"left: 0; top: 0; z-index: 0; outline: 0\")\n", "\n", " this.canvas = canvas[0];\n", " this.context = canvas[0].getContext(\"2d\");\n", "\n", " var backingStore = this.context.backingStorePixelRatio ||\n", "\tthis.context.webkitBackingStorePixelRatio ||\n", "\tthis.context.mozBackingStorePixelRatio ||\n", "\tthis.context.msBackingStorePixelRatio ||\n", "\tthis.context.oBackingStorePixelRatio ||\n", "\tthis.context.backingStorePixelRatio || 1;\n", "\n", " mpl.ratio = (window.devicePixelRatio || 1) / backingStore;\n", "\n", " var rubberband = $('');\n", " rubberband.attr('style', \"position: absolute; left: 0; top: 0; z-index: 1;\")\n", "\n", " var pass_mouse_events = true;\n", "\n", " canvas_div.resizable({\n", " start: function(event, ui) {\n", " pass_mouse_events = false;\n", " },\n", " resize: function(event, ui) {\n", " fig.request_resize(ui.size.width, ui.size.height);\n", " },\n", " stop: function(event, ui) {\n", " pass_mouse_events = true;\n", " fig.request_resize(ui.size.width, ui.size.height);\n", " },\n", " });\n", "\n", " function mouse_event_fn(event) {\n", " if (pass_mouse_events)\n", " return fig.mouse_event(event, event['data']);\n", " }\n", "\n", " rubberband.mousedown('button_press', mouse_event_fn);\n", " rubberband.mouseup('button_release', mouse_event_fn);\n", " // Throttle sequential mouse events to 1 every 20ms.\n", " rubberband.mousemove('motion_notify', mouse_event_fn);\n", "\n", " rubberband.mouseenter('figure_enter', mouse_event_fn);\n", " rubberband.mouseleave('figure_leave', mouse_event_fn);\n", "\n", " canvas_div.on(\"wheel\", function (event) {\n", " event = event.originalEvent;\n", " event['data'] = 'scroll'\n", " if (event.deltaY < 0) {\n", " event.step = 1;\n", " } else {\n", " event.step = -1;\n", " }\n", " mouse_event_fn(event);\n", " });\n", "\n", " canvas_div.append(canvas);\n", " canvas_div.append(rubberband);\n", "\n", " this.rubberband = rubberband;\n", " this.rubberband_canvas = rubberband[0];\n", " this.rubberband_context = rubberband[0].getContext(\"2d\");\n", " this.rubberband_context.strokeStyle = \"#000000\";\n", "\n", " this._resize_canvas = function(width, height) {\n", " // Keep the size of the canvas, canvas container, and rubber band\n", " // canvas in synch.\n", " canvas_div.css('width', width)\n", " canvas_div.css('height', height)\n", "\n", " canvas.attr('width', width * mpl.ratio);\n", " canvas.attr('height', height * mpl.ratio);\n", " canvas.attr('style', 'width: ' + width + 'px; height: ' + height + 'px;');\n", "\n", " rubberband.attr('width', width);\n", " rubberband.attr('height', height);\n", " }\n", "\n", " // Set the figure to an initial 600x600px, this will subsequently be updated\n", " // upon first draw.\n", " this._resize_canvas(600, 600);\n", "\n", " // Disable right mouse context menu.\n", " $(this.rubberband_canvas).bind(\"contextmenu\",function(e){\n", " return false;\n", " });\n", "\n", " function set_focus () {\n", " canvas.focus();\n", " canvas_div.focus();\n", " }\n", "\n", " window.setTimeout(set_focus, 100);\n", "}\n", "\n", "mpl.figure.prototype._init_toolbar = function() {\n", " var fig = this;\n", "\n", " var nav_element = $('
')\n", " nav_element.attr('style', 'width: 100%');\n", " this.root.append(nav_element);\n", "\n", " // Define a callback function for later on.\n", " function toolbar_event(event) {\n", " return fig.toolbar_button_onclick(event['data']);\n", " }\n", " function toolbar_mouse_event(event) {\n", " return fig.toolbar_button_onmouseover(event['data']);\n", " }\n", "\n", " for(var toolbar_ind in mpl.toolbar_items) {\n", " var name = mpl.toolbar_items[toolbar_ind][0];\n", " var tooltip = mpl.toolbar_items[toolbar_ind][1];\n", " var image = mpl.toolbar_items[toolbar_ind][2];\n", " var method_name = mpl.toolbar_items[toolbar_ind][3];\n", "\n", " if (!name) {\n", " // put a spacer in here.\n", " continue;\n", " }\n", " var button = $('