Recitation 2
Homework Tips
- 20 minutes reading documentation will save you 8 hours of writing code
- Be familiar with the What Can I Ask On Diderot? policy
- Talk to other students taking the course – they can help you and you can help them.
- Look for the “Common Problems in Homework x” post on Diderot before asking questions online.
TA Hours
- Come this week! Don’t wait until the last week.
- We will only help you with your code if you construct a minimal counter-example : a simplest test case that fails.
import pandas as pd
import sqlite3
import gzip
import scipy.sparse as sp
from random import shuffle
Views and Data
Pandas loads data and presents you with a view of it. Most read-only operations on the DataFrame
change the view but leave the underlying data intact; this allows you to work quickly (because copying is expensive) with very large datasets (big enough that you can only keep one copy in memory).
Here’s an example using a modified version of the Crimes in Boston dataset (CC0):
with gzip.open("crime_min.csv.gz", "rt", encoding="UTF-8") as crime_file:
crime = pd.read_csv(crime_file)
crime["date"] = pd.to_datetime(crime['date'], infer_datetime_format=True)
with gzip.open("offense_codes.csv.gz", "rt", encoding="UTF-8") as offense_file:
offense = pd.read_csv(offense_file).drop_duplicates("code")
crime
incident_num | offense_code | district | date | lat | lon | |
---|---|---|---|---|---|---|
0 | I182070945 | 619 | D14 | 2018-09-02 13:00:00 | 42.357791 | -71.139371 |
1 | I182070943 | 1402 | C11 | 2018-08-21 00:00:00 | 42.306821 | -71.060300 |
2 | I182070941 | 3410 | D4 | 2018-09-03 19:27:00 | 42.346589 | -71.072429 |
3 | I182070940 | 3114 | D4 | 2018-09-03 21:16:00 | 42.334182 | -71.078664 |
4 | I182070938 | 3114 | B3 | 2018-09-03 21:05:00 | 42.275365 | -71.090361 |
... | ... | ... | ... | ... | ... | ... |
319068 | I050310906-00 | 3125 | D4 | 2016-06-05 17:25:00 | 42.336951 | -71.085748 |
319069 | I030217815-08 | 111 | E18 | 2015-07-09 13:38:00 | 42.255926 | -71.123172 |
319070 | I030217815-08 | 3125 | E18 | 2015-07-09 13:38:00 | 42.255926 | -71.123172 |
319071 | I010370257-00 | 3125 | E13 | 2016-05-31 19:35:00 | 42.302333 | -71.111565 |
319072 | 142052550 | 3125 | D4 | 2015-06-22 00:12:00 | 42.333839 | -71.080290 |
319073 rows × 6 columns
It uses a decent amount of memory:
crime.memory_usage(deep=True)
str(sum(crime.memory_usage(deep=True)) // 1024 // 1024) + " MB"
This doesn’t seem like much, but datasets get very large very quickly. Let’s select all reports of “GATHERING CAUSING ANNOYANCE”:
gca = crime[crime.offense_code == 3302]
gca
incident_num | offense_code | district | date | lat | lon | |
---|---|---|---|---|---|---|
24331 | I182044697 | 3302 | D4 | 2018-06-09 14:36:00 | 42.351251 | -71.073052 |
37259 | I182030938 | 3302 | D4 | 2018-04-25 12:54:00 | 42.341318 | -71.078784 |
56534 | I182010380 | 3302 | A1 | 2018-02-08 17:20:00 | 42.355407 | -71.063124 |
62970 | I182003526 | 3302 | E5 | 2018-01-13 23:31:00 | 42.286228 | -71.124498 |
63220 | I182003288 | 3302 | B2 | 2018-01-12 23:55:00 | 42.333220 | -71.109439 |
150827 | I172017387 | 3302 | A1 | 2017-03-04 11:13:00 | 42.353110 | -71.064323 |
160521 | I172007037 | 3302 | A1 | 2017-01-26 14:59:00 | NaN | NaN |
165072 | I172002246 | 3302 | A1 | 2017-01-09 12:04:00 | 42.356502 | -71.062000 |
177200 | I162095504 | 3302 | C11 | 2016-11-22 05:20:00 | 42.304815 | -71.072183 |
186443 | I162085653 | 3302 | A1 | 2016-10-19 12:18:00 | NaN | NaN |
222475 | I162046897 | 3302 | B2 | 2016-06-13 23:56:00 | 42.327541 | -71.099500 |
227620 | I162041405 | 3302 | B2 | 2016-05-27 02:37:00 | 42.332590 | -71.100314 |
247973 | I162019520 | 3302 | C11 | 2016-03-13 00:20:00 | 42.295072 | -71.047497 |
251477 | I162015700 | 3302 | B2 | 2016-02-27 23:12:00 | 42.329751 | -71.098977 |
270919 | I152102472 | 3302 | B2 | 2015-12-12 00:33:00 | 42.330570 | -71.099591 |
271448 | I152101891 | 3302 | B2 | 2015-12-09 23:38:00 | 42.331286 | -71.102540 |
282045 | I152090122 | 3302 | B2 | 2015-10-30 22:01:00 | 42.334278 | -71.102952 |
285434 | I152086419 | 3302 | C11 | 2015-10-17 16:14:00 | 42.303565 | -71.078681 |
285585 | I152086244 | 3302 | B2 | 2015-10-17 01:10:00 | 42.329645 | -71.097472 |
287262 | I152084444 | 3302 | B2 | 2015-10-11 00:33:00 | 42.337683 | -71.096668 |
291103 | I152080302 | 3302 | B2 | 2015-09-26 17:45:00 | 42.327016 | -71.105551 |
291260 | I152080126 | 3302 | B2 | 2015-09-26 02:05:00 | 42.331513 | -71.104949 |
291262 | I152080124 | 3302 | B2 | 2015-09-26 02:59:00 | 42.334097 | -71.102264 |
293338 | I152077876 | 3302 | B2 | 2015-09-19 00:25:00 | 42.331348 | -71.103225 |
293411 | I152077794 | 3302 | E5 | 2015-09-18 18:01:00 | 42.285370 | -71.172440 |
295087 | I152075994 | 3302 | B2 | 2015-09-13 00:19:00 | 42.331836 | -71.104329 |
310780 | I152058711 | 3302 | C6 | 2015-07-16 06:52:00 | 42.353208 | -71.046471 |
We see that the created frame is a copy. It has a weak reference to the original DataFrame:
gca._is_copy
We print out the index and see that it refers to the selected rows in the original DataFrame by row id:
gca.index
Making a copy creates a new backing array, independent of the original data:
gca_copy = gca.copy()
print(gca_copy._is_copy)
None
Databases and Indices
pandas
and RDBMS software work off of indices. These exist to speed up queries.
The reason indexes are so important is because of data normalization. By allowing us to efficiently connect information dispersed over several tables, we can store data in a way that minimizes redundancy and maximizes throughput. Here’s a basic introduction to the concept.
speed_crime = crime.copy()
speed_crime.index = pd.DatetimeIndex(speed_crime.date)
str(sum(speed_crime.memory_usage(deep=True)) // 1024 // 1024) + " MB"
%%timeit
entries = []
fr, to = pd.to_datetime("2016-11-01"), pd.to_datetime("2016-12-01")
for idx, row in speed_crime.iterrows():
if (idx > fr) and (idx < to):
entries.append(row)
%%timeit
speed_crime.loc["2016-11-01":"2016-12-01"]
2.52 ms ± 32.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It’s a little quicker! This is why you’re expected to complete hw2_time_series
without looping over the data.
The chief gains in this are from vectorization (which is running as much of the code sequentially as low-level as possible), and then from indexing (which is preparing a lookup data structure to find parts of the data more efficiently).
Join operation
The use of an index allows us to join two tables together using common column values. Lets take a look at the offense
table:
offense[offense.code.isin([3302, 3116, 1103, 3122, 2401])]
code | name | |
---|---|---|
6 | 1103 | CONFIDENCE GAMES |
135 | 2401 | AFFRAY |
261 | 3116 | HARBOR INCIDENT / VIOLATION |
266 | 3122 | AIRCRAFT INCIDENTS |
299 | 3302 | GATHERING CAUSING ANNOYANCE |
We’ve previously set the code
as the index:
offense_indexed = offense.set_index("code")
offense_indexed
name | |
---|---|
code | |
1001 | COUNTERFEITING |
1002 | FORGERY OR UTTERING |
1101 | PASSING WORTHLESS CHECK |
1102 | FRAUD - FALSE PRETENSE |
1103 | CONFIDENCE GAMES |
... | ... |
905 | ARSON - OTHER COMMERCIAL |
906 | ARSON - COMMUNITY/PUB.STRUC. |
907 | ARSON - ALL OTHER STRUCTURES |
920 | ARSON - MOTOR VEHICLES |
930 | ARSON - OTHER |
425 rows × 1 columns
Now we can join
the two tables, selecting the offense.name
where crime.offense_code == offense.code
:
crime.join(offense_indexed, on="offense_code", how="left")
incident_num | offense_code | district | date | lat | lon | name | |
---|---|---|---|---|---|---|---|
0 | I182070945 | 619 | D14 | 2018-09-02 13:00:00 | 42.357791 | -71.139371 | LARCENY ALL OTHERS |
1 | I182070943 | 1402 | C11 | 2018-08-21 00:00:00 | 42.306821 | -71.060300 | VANDALISM |
2 | I182070941 | 3410 | D4 | 2018-09-03 19:27:00 | 42.346589 | -71.072429 | TOWED MOTOR VEHICLE |
3 | I182070940 | 3114 | D4 | 2018-09-03 21:16:00 | 42.334182 | -71.078664 | INVESTIGATE PROPERTY |
4 | I182070938 | 3114 | B3 | 2018-09-03 21:05:00 | 42.275365 | -71.090361 | INVESTIGATE PROPERTY |
... | ... | ... | ... | ... | ... | ... | ... |
319068 | I050310906-00 | 3125 | D4 | 2016-06-05 17:25:00 | 42.336951 | -71.085748 | WARRANT ARREST |
319069 | I030217815-08 | 111 | E18 | 2015-07-09 13:38:00 | 42.255926 | -71.123172 | MURDER, NON-NEGLIGIENT MANSLAUGHTER |
319070 | I030217815-08 | 3125 | E18 | 2015-07-09 13:38:00 | 42.255926 | -71.123172 | WARRANT ARREST |
319071 | I010370257-00 | 3125 | E13 | 2016-05-31 19:35:00 | 42.302333 | -71.111565 | WARRANT ARREST |
319072 | 142052550 | 3125 | D4 | 2015-06-22 00:12:00 | 42.333839 | -71.080290 | WARRANT ARREST |
319073 rows × 7 columns
Once More With SQLite
conn = sqlite3.connect(':memory:')
crime.reset_index().to_sql("crime", conn, if_exists="replace")
offense.reset_index().to_sql("offense", conn, if_exists="replace")
conn.execute("PRAGMA automatic_index = false;") # Disable automatic indexing for this demo
conn.commit()
conn.execute("SELECT * FROM `crime`;").fetchmany(2)
conn.execute("SELECT * FROM `offense`;").fetchmany(2)
SQLite Indexes and Join
We can set SQLite indexes up pretty easily. Lets try the join query from earlier without one first:
join_query = "SELECT * FROM `crime` LEFT JOIN `offense` ON crime.`offense_code` = offense.`code`;"
conn.execute(join_query).fetchmany(2)
It works, but how does it happen? We can use the EXPLAIN QUERY PLAN
instruction to get SQLite to tell us its strategy.
conn.execute(f"EXPLAIN QUERY PLAN {join_query}").fetchall()
SCAN TABLE
means it searches the table by looping over it, the slowest way of doing this. What this means is that SQLite has to loop over the offense
table once for each entry in crime
.
Now lets create an index and do this again:
conn.execute(f"CREATE INDEX code_index ON offense(`code`)")
conn.execute(f"EXPLAIN QUERY PLAN {join_query}").fetchall()
SEARCH TABLE
means that the index code_index
is used to efficiently locate entries inside offense
. This means that SQLite has to efficiently find entries from offense
for each entry in crime
.
Matrices and Sparse Representation
There’s a basic problem with matrices. An $n \times m$ matrix of type float64
takes $n\times m\times 8$ bytes of memory. That adds up to a lot of space.
That’s why we need sparse representations.
Different Representations
There are two different sparse representations we’re going to use.
The first is the Coordinate or i, j, v
format. It’s called that because we store the matrix as pairs of i, j
coordinates and v
values.
i = [0, 1, 2, 3, 4, 5, 6]
j = [6, 1, 4, 3, 2, 0, 5]
v = [1, 2, 3, 4, 5, 6, 7]
m = sp.coo_matrix((v, (i, j)), shape=(7,7))
Lets convert the matrix to the dense form to see it:
print(str(m.todense()).replace("0", " "))
[[ 1] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [6 ] [ 7 ]]
We can create matrices that are very large. A $10,000\times 10,000$ matrix of float64
will take 1.6 GB of memory to create.
size=20000
i = list(range(size)) * 200
j = list(range(size)) * 200
v = list(range(1, size + 1)) * 200
shuffle(j)
m = sp.coo_matrix((v, (i, j)), shape=(size, size))
m
str((m.col.nbytes + m.row.nbytes + m.data.nbytes) // 1024 // 1024) + " MB"
The most important operation with matrices is matrix multiplication. If we try to use this format to perform matrix multiplication, we’ll have to repeatedly search over the entire list. We need a more efficient way to store the sparse values.
That’s where the Compressed Sparse Row representation comes in:
md = m.tocsr()
md
str((md.data.nbytes + md.indptr.nbytes + md.indices.nbytes) // 1024 // 1024) + " MB"
The Compressed Sparse Row representation stores each row as a (sometimes sorted) list of (index, values)
pairs. This allows for much quicker math when the size and density of the matrix is sufficiently large. Here we try it on a matrix of about 1% density:
%%timeit
m * m
8.97 s ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
md * md
2.76 s ± 6.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)