{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Recitation 1 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Common Piazza questions, writeup updates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See piazza link for post https://piazza.com/class/jcizpany5u6522?cid=92\n", "\n", "\n", "** Writeup typos: ** \n", "* In scraper Q1, number of Pittsburgh Yelp Businesses should be 2900, not 13400\n", "* In scraper Q2, number of restaurants in Polish Hill should be 42 not 41.\n", "* In scraper Q2, all_restaurants should return a list of dictionaries, not a list of strings. The print(data) statement in the reference output should have been print(list(map(lambda x: x['name'], data)))\n", "* In XML parser Q1, html_prolog should be html_declaration\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Helpful Homework information \n", "\n", "* start early!\n", "* the first question is misleadingly easy [20 min max]\n", "* learn how to work your way through documentation\n", "* write your own tests!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Demo: how to Submit to Autolab:\n", "* run make \n", "* upload tarball to autolab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More Helpful Jupyter notebook commands" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!ls # use bash commands in a code block" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Y to convert a block to code \n", "* M to convert a block into markdown\n", "* L to toggle line number" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Web scraping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Requests library\n", "\n", "http://docs.python-requests.org/en/master/\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameters:" ] }, { "cell_type": "code", "execution_count": 360, "metadata": { "scrolled": true }, "outputs": [], "source": [ "url = \"http://www.google.com/search\"\n", "params = {\n", " \"query\": \"python metaclass\", \n", " \"source\":\"chrome\"\n", "} #parameter is a python dict\n", "response = requests.get(url, params=params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Useful methods" ] }, { "cell_type": "code", "execution_count": 361, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'http://www.google.com/search?query=python+metaclass&source=chrome'" ] }, "execution_count": 361, "metadata": {}, "output_type": "execute_result" } ], "source": [ "response.url #returns complete url of response query" ] }, { "cell_type": "code", "execution_count": 362, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "200" ] }, "execution_count": 362, "metadata": {}, "output_type": "execute_result" } ], "source": [ "response.status_code # status code of 200 indicates success. Any other status code is not good" ] }, { "cell_type": "code", "execution_count": 351, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "'python metaclass - Google Search