⭳ Download Jupyter Notebook

Recitation 1

Homework Tips

Autograder

TA Hours

Environment Setup

Python environments take a little practice to get exactly right. It is very easy to spectacularly mess it up, as Randall Munroe illustrates on xkcd:

We encourage you to use Vagrant because it sets up a clean, repeatable environment in a virtual machine. You may also wish to use Anaconda (a popular all-in-one solution), or Windows Subsystem for Linux (run Ubuntu/Fedora/Debian as a Windows app).

Also, never run sudo pip install .... Instead, try pip --user install ...

Running sudo pip install ... gives random people on the internet, and me, root access to your machine. That is a Very Bad Idea.

Writing Tests

Writing good tests requires you understand the problem domain and what can go wrong. Typically, you learn what can go wrong by making those mistakes over and over.

Approach

When writing your tests, there are two approaches:

We give you both tests, and our grader runs both types of tests on your code, but we generally weigh top-down tests much more. If you cannot pass a test, it is likely that there is an unknown unknown factor at play and you should try your code on more tests.

Complexity

Also, you don’t write a single test so much as a suite of tests. Instead of just writing many large tests, begin by writing the smallest possible test to check a feature then copy it and add complexity to it. If you discover a bug, the tests that pass give you valuable information on what must be causing it.

Regression

Write regression tests. When you discover a bug

  1. write a test that should fail
  2. check that the test does fail
  3. fix the bug
  4. check that the test now passes

This way you will catch similar bugs in the future.

For hw1_xml_parser, we have a thread where you can share your tests; regression tests are especially valuable because if you made a mistake it is likely that others will as well.

Scraping with requests

Read the documentation.

import requests

What is HTTP?

A protocol for automatically requesting (and getting) data from text-based servers. The using human-readable text

When you navigate to www.example.com, here’s what my browser sends as plain (UTF-8) text:

GET / HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0
Accept: text/html
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate

We begin with the request type (GET), the path (/), and the protocol version (HTTP/1.1). The browser then identifies itself and advertises its capabilities.

The server sends back:

HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Type: text/html; charset=UTF-8
Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT
Server: ECS (phd/FD6D)
Content-Length: 606

<!doctype html>
<html>
...

The response begins with the headers: the protocol version HTTP/1.1 and that the request succeeded (200 OK). Then it gives you information about the response and the server. Finally, it leaves a blank line and then gives you the response body.

requests is a library that handles the heavy lifting of this process. It

Using Requests

response = requests.get("http://www.google.com/search",
                        params={ "query": "python metaclass", "source":"chrome" })

We send request headers that look something like this:

GET /?query=python%20metaclass&source=chrome HTTP/1.1
Host: www.google.com
Accept-Encoding: gzip, deflate, compress
Accept: */*
User-Agent: python-requests/2.1.0 CPython/3.6.7 Linux/3.2.0-23-generic-pae

Note how the GET parameters are added to the requested filename, and they are encoded to remove some characters. requests handles this for you automatically.

Lets examine the response we get. We can get the response url and the status code as properties:

response.url
response.status_code # `200` means the request was successful
from pprint import pprint # Pretty-printer
pprint(dict(response.headers))
{'Cache-Control': 'private, max-age=0',
 'Content-Encoding': 'gzip',
 'Content-Type': 'text/html; charset=ISO-8859-1',
 'Date': 'Thu, 05 Sep 2019 03:58:08 GMT',
 'Expires': '-1',
 'P3P': 'CP="This is not a P3P policy! See g.co/p3phelp for more info."',
 'Server': 'gws',
 'Set-Cookie': '1P_JAR=2019-09-05-03; expires=Sat, 05-Oct-2019 03:58:08 GMT; '
               'path=/; domain=.google.com; SameSite=none, CGIC=IgMqLyo; '
               'expires=Tue, 03-Mar-2020 03:58:08 GMT; path=/complete/search; '
               'domain=.google.com; HttpOnly, CGIC=IgMqLyo; expires=Tue, '
               '03-Mar-2020 03:58:08 GMT; path=/search; domain=.google.com; '
               'HttpOnly, '
               'NID=188=kT-c41uxIh8UsAuyXVBzg6CukUgfuwceNA2AyFvaosgsM_B0X9lps4MwweFhF2VwzbEOoX_hD6e4tHRKFfuKF-dNO0u0Rl1RUiUCILdSw9aQvnWdOXPlKg0ZeG-sfiMkPwX3YKhgx7XMTXMBpye1rZgNWkcbCA2ZKjEAmYjpoOQ; '
               'expires=Fri, 06-Mar-2020 03:58:08 GMT; path=/; '
               'domain=.google.com; HttpOnly',
 'Transfer-Encoding': 'chunked',
 'X-Frame-Options': 'SAMEORIGIN',
 'X-XSS-Protection': '0'}

Response body and encoding.

(type(response.text), response.text[:100])
(type(response.content), response.content[:100])

Notice that the returned type is different between .text and .content.

Content-type conflicts

Notice there’s a conflict in the Content-Type header and in the HTML itself:

response.headers["Content-Type"]
response.text[37:59]

This happens very, very often. BeautifulSoup4 and many other libraries accept the bytes directly and automatically figure out the encoding. (This was the autograder bug.) Here’s what happens when you use the wrong encoding:

print("Bytes:")
print(b'\xf0\x9f\x92\xa9')
print("Using the Content-Type encoding: [ISO-8859-1]")
print(b'\xf0\x9f\x92\xa9'.decode("ISO-8859-1"))
print("Using the <meta> tag encoding: [UTF-8]")
print(b'\xf0\x9f\x92\xa9'.decode("UTF-8"))

Bytes:
b'\xf0\x9f\x92\xa9'
Using the Content-Type encoding: [ISO-8859-1]
💩
Using the <meta> tag encoding: [UTF-8]
💩

Regular Expressions

Before you use regular expressions, read this. If you’re interested in practicing with RegExes, you can read about Regex Golf and then try it yourself.

Regular expressions are a way to find or extract text from strings. For this tutorial, you should keep open one of these cheat sheets: MIT, RegexLib and an online RegEx tester (recommended).

There are many flavors of RegExps; they handle the basics the same way but have subtle differences around backreferences.

We’ll set up some a testing function and use it to run re.match(...) on some regular expressions and examples.

import re
from IPython.display import display, Markdown, Latex

# FEEL FREE TO IGNORE THIS CODE

def _match(regex, example, search=re.search):
    m = search(regex, example)
    if m:
        st, en = m.span()
        return f"{example[:st]}<u>{example[st:en]}</u>{example[en:]}"
    else:
        return f"<s>{example}</s>"

def pm(examples, regexes, search=re.search):
    examples = examples if isinstance(examples, list) else [examples]
    regexes = regexes if isinstance(regexes, list) else [regexes]
    output=[]
    for regex in regexes:
        display(Markdown(f"**re.{search.__name__} {regex} :** " + ", ".join(_match(regex, example, search) for example in examples)))
        
def pmg(examples, regex, match=re.match):
    examples = examples if isinstance(examples, list) else [examples]
    for example in examples:
        a = match(regex, example)
        display(Markdown(f"**re.{match.__name__}(..., {example}) :** " + ", ".join(f"'{m}'" for i, m in enumerate(a.groups()))))

Lets begin with a warmup. Here’s how you match one character:

pm(["bat", "bit", "bot", "but", "batty", "bitty", "and", "or", "not"],
   "b[aeiou]t")

And here’s how you match distinct options. Note how the a in batty is matched – most RegExp engines will find the first match.

pm(["bat", "bit", "bot", "but", "batty", "bitty", "and", "or", "not"],
   ["(tt|a)", "(tt|a)"])

re.search starts from any position; you can use re.match to check the string starting from a position. (There are other options, like re.findall)

pm(["bot", "talbot", "botobot"], ["bot"], re.search)
pm(["bot", "talbot", "botobot"], ["bot"], re.match)
pm(["bot", "talbot", "botobot"], ["bot"], re.fullmatch)

Matching some occurences can be controlled using this syntax:

pm(["bt", "bat", "baat", "baaat", "baaaat"],
   ["bat", "ba?t", "ba+t", "ba*t", "ba{2,3}t"])

There is plenty more syntax:

There are more details; consult the documentation. Also, note that the Python 3 re module does not guarantee Unicode support.

Capture Groups

This is how we extract parts of structured data:

pmg(["10:00 am", "10:20 am", "3:11 pm"],
    r"((\d\d?):(\d\d)) (am|pm)")

Greedy Matching

RegExes’ multi-selectors (*+) match the longest possible string by default. For example:

pm(["(1 + 2) * (3 + 4)"],
   [r"\(.*\)", "\(.+\)", "\(.{2,}\)"])

You can make them match the shortest possible string instead using ?. This is called lazy matching.

pm(["(1 + 2) * (3 + 4)"],
   [r"\(.*?\)", "\(.+?\)", "\(.{2,}?\)"])