Recitation 2

Homework Tips

20 minutes reading documentation will save you 8 hours of writing code
Be familiar with the What Can I Ask On Diderot? policy
Talk to other students taking the course – they can help you and you can help them.
Look for the “Common Problems in Homework x” post on Diderot before asking questions online.

TA Hours

Come this week! Don’t wait until the last week.
We will only help you with your code if you construct a minimal counter-example : a simplest test case that fails.

import pandas as pd
import sqlite3
import gzip
import scipy.sparse as sp
from random import shuffle

Views and Data

Pandas loads data and presents you with a view of it. Most read-only operations on the DataFrame change the view but leave the underlying data intact; this allows you to work quickly (because copying is expensive) with very large datasets (big enough that you can only keep one copy in memory).

Here’s an example using a modified version of the Crimes in Boston dataset (CC0):

with gzip.open("crime_min.csv.gz", "rt", encoding="UTF-8") as crime_file:
    crime = pd.read_csv(crime_file)
    crime["date"] = pd.to_datetime(crime['date'], infer_datetime_format=True)

with gzip.open("offense_codes.csv.gz", "rt", encoding="UTF-8") as offense_file:
    offense = pd.read_csv(offense_file).drop_duplicates("code")

crime

	incident_num	offense_code	district	date	lat	lon
0	I182070945	619	D14	2018-09-02 13:00:00	42.357791	-71.139371
1	I182070943	1402	C11	2018-08-21 00:00:00	42.306821	-71.060300
2	I182070941	3410	D4	2018-09-03 19:27:00	42.346589	-71.072429
3	I182070940	3114	D4	2018-09-03 21:16:00	42.334182	-71.078664
4	I182070938	3114	B3	2018-09-03 21:05:00	42.275365	-71.090361
...	...	...	...	...	...	...
319068	I050310906-00	3125	D4	2016-06-05 17:25:00	42.336951	-71.085748
319069	I030217815-08	111	E18	2015-07-09 13:38:00	42.255926	-71.123172
319070	I030217815-08	3125	E18	2015-07-09 13:38:00	42.255926	-71.123172
319071	I010370257-00	3125	E13	2016-05-31 19:35:00	42.302333	-71.111565
319072	142052550	3125	D4	2015-06-22 00:12:00	42.333839	-71.080290

319073 rows × 6 columns

It uses a decent amount of memory:

crime.memory_usage(deep=True)

str(sum(crime.memory_usage(deep=True)) // 1024 // 1024) + " MB"

This doesn’t seem like much, but datasets get very large very quickly. Let’s select all reports of “GATHERING CAUSING ANNOYANCE”:

gca = crime[crime.offense_code == 3302]
gca

	incident_num	offense_code	district	date	lat	lon
24331	I182044697	3302	D4	2018-06-09 14:36:00	42.351251	-71.073052
37259	I182030938	3302	D4	2018-04-25 12:54:00	42.341318	-71.078784
56534	I182010380	3302	A1	2018-02-08 17:20:00	42.355407	-71.063124
62970	I182003526	3302	E5	2018-01-13 23:31:00	42.286228	-71.124498
63220	I182003288	3302	B2	2018-01-12 23:55:00	42.333220	-71.109439
150827	I172017387	3302	A1	2017-03-04 11:13:00	42.353110	-71.064323
160521	I172007037	3302	A1	2017-01-26 14:59:00	NaN	NaN
165072	I172002246	3302	A1	2017-01-09 12:04:00	42.356502	-71.062000
177200	I162095504	3302	C11	2016-11-22 05:20:00	42.304815	-71.072183
186443	I162085653	3302	A1	2016-10-19 12:18:00	NaN	NaN
222475	I162046897	3302	B2	2016-06-13 23:56:00	42.327541	-71.099500
227620	I162041405	3302	B2	2016-05-27 02:37:00	42.332590	-71.100314
247973	I162019520	3302	C11	2016-03-13 00:20:00	42.295072	-71.047497
251477	I162015700	3302	B2	2016-02-27 23:12:00	42.329751	-71.098977
270919	I152102472	3302	B2	2015-12-12 00:33:00	42.330570	-71.099591
271448	I152101891	3302	B2	2015-12-09 23:38:00	42.331286	-71.102540
282045	I152090122	3302	B2	2015-10-30 22:01:00	42.334278	-71.102952
285434	I152086419	3302	C11	2015-10-17 16:14:00	42.303565	-71.078681
285585	I152086244	3302	B2	2015-10-17 01:10:00	42.329645	-71.097472
287262	I152084444	3302	B2	2015-10-11 00:33:00	42.337683	-71.096668
291103	I152080302	3302	B2	2015-09-26 17:45:00	42.327016	-71.105551
291260	I152080126	3302	B2	2015-09-26 02:05:00	42.331513	-71.104949
291262	I152080124	3302	B2	2015-09-26 02:59:00	42.334097	-71.102264
293338	I152077876	3302	B2	2015-09-19 00:25:00	42.331348	-71.103225
293411	I152077794	3302	E5	2015-09-18 18:01:00	42.285370	-71.172440
295087	I152075994	3302	B2	2015-09-13 00:19:00	42.331836	-71.104329
310780	I152058711	3302	C6	2015-07-16 06:52:00	42.353208	-71.046471

We see that the created frame is a copy. It has a weak reference to the original DataFrame:

gca._is_copy

We print out the index and see that it refers to the selected rows in the original DataFrame by row id:

gca.index

Making a copy creates a new backing array, independent of the original data:

gca_copy = gca.copy()
print(gca_copy._is_copy)

None

Databases and Indices

pandas and RDBMS software work off of indices. These exist to speed up queries.

The reason indexes are so important is because of data normalization. By allowing us to efficiently connect information dispersed over several tables, we can store data in a way that minimizes redundancy and maximizes throughput. Here’s a basic introduction to the concept.

speed_crime = crime.copy()
speed_crime.index = pd.DatetimeIndex(speed_crime.date)
str(sum(speed_crime.memory_usage(deep=True)) // 1024 // 1024) + " MB"

%%timeit
entries = []
fr, to = pd.to_datetime("2016-11-01"), pd.to_datetime("2016-12-01")
for idx, row in speed_crime.iterrows():
    if (idx > fr)  and (idx < to):
        entries.append(row)

%%timeit
speed_crime.loc["2016-11-01":"2016-12-01"]

2.52 ms ± 32.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It’s a little quicker! This is why you’re expected to complete hw2_time_series without looping over the data.

The chief gains in this are from vectorization (which is running as much of the code sequentially as low-level as possible), and then from indexing (which is preparing a lookup data structure to find parts of the data more efficiently).

Join operation

The use of an index allows us to join two tables together using common column values. Lets take a look at the offense table:

offense[offense.code.isin([3302, 3116, 1103, 3122, 2401])]

	code	name
6	1103	CONFIDENCE GAMES
135	2401	AFFRAY
261	3116	HARBOR INCIDENT / VIOLATION
266	3122	AIRCRAFT INCIDENTS
299	3302	GATHERING CAUSING ANNOYANCE

We’ve previously set the code as the index:

offense_indexed = offense.set_index("code")
offense_indexed

	name
code
1001	COUNTERFEITING
1002	FORGERY OR UTTERING
1101	PASSING WORTHLESS CHECK
1102	FRAUD - FALSE PRETENSE
1103	CONFIDENCE GAMES
...	...
905	ARSON - OTHER COMMERCIAL
906	ARSON - COMMUNITY/PUB.STRUC.
907	ARSON - ALL OTHER STRUCTURES
920	ARSON - MOTOR VEHICLES
930	ARSON - OTHER

425 rows × 1 columns

Now we can join the two tables, selecting the offense.name where crime.offense_code == offense.code:

crime.join(offense_indexed, on="offense_code", how="left")

	incident_num	offense_code	district	date	lat	lon	name
0	I182070945	619	D14	2018-09-02 13:00:00	42.357791	-71.139371	LARCENY ALL OTHERS
1	I182070943	1402	C11	2018-08-21 00:00:00	42.306821	-71.060300	VANDALISM
2	I182070941	3410	D4	2018-09-03 19:27:00	42.346589	-71.072429	TOWED MOTOR VEHICLE
3	I182070940	3114	D4	2018-09-03 21:16:00	42.334182	-71.078664	INVESTIGATE PROPERTY
4	I182070938	3114	B3	2018-09-03 21:05:00	42.275365	-71.090361	INVESTIGATE PROPERTY
...	...	...	...	...	...	...	...
319068	I050310906-00	3125	D4	2016-06-05 17:25:00	42.336951	-71.085748	WARRANT ARREST
319069	I030217815-08	111	E18	2015-07-09 13:38:00	42.255926	-71.123172	MURDER, NON-NEGLIGIENT MANSLAUGHTER
319070	I030217815-08	3125	E18	2015-07-09 13:38:00	42.255926	-71.123172	WARRANT ARREST
319071	I010370257-00	3125	E13	2016-05-31 19:35:00	42.302333	-71.111565	WARRANT ARREST
319072	142052550	3125	D4	2015-06-22 00:12:00	42.333839	-71.080290	WARRANT ARREST

319073 rows × 7 columns

Once More With SQLite

conn = sqlite3.connect(':memory:')
crime.reset_index().to_sql("crime", conn, if_exists="replace")
offense.reset_index().to_sql("offense", conn, if_exists="replace")
conn.execute("PRAGMA automatic_index = false;") # Disable automatic indexing for this demo
conn.commit()

conn.execute("SELECT * FROM `crime`;").fetchmany(2)

conn.execute("SELECT * FROM `offense`;").fetchmany(2)

SQLite Indexes and Join

We can set SQLite indexes up pretty easily. Lets try the join query from earlier without one first:

join_query = "SELECT * FROM `crime` LEFT JOIN `offense` ON crime.`offense_code` = offense.`code`;"
conn.execute(join_query).fetchmany(2)

It works, but how does it happen? We can use the EXPLAIN QUERY PLAN instruction to get SQLite to tell us its strategy.

conn.execute(f"EXPLAIN QUERY PLAN {join_query}").fetchall()

SCAN TABLE means it searches the table by looping over it, the slowest way of doing this. What this means is that SQLite has to loop over the offense table once for each entry in crime.

Now lets create an index and do this again:

conn.execute(f"CREATE INDEX code_index ON offense(`code`)")
conn.execute(f"EXPLAIN QUERY PLAN {join_query}").fetchall()

SEARCH TABLE means that the index code_index is used to efficiently locate entries inside offense. This means that SQLite has to efficiently find entries from offense for each entry in crime.

Matrices and Sparse Representation

There’s a basic problem with matrices. An $n \times m$ matrix of type float64 takes $n\times m\times 8$ bytes of memory. That adds up to a lot of space.

\[\begin{pmatrix} 1 & 2 & 0 & 0 & 0 & 0 \\ 0 & 3 & 0 & 4 & 0 & 0 \\ 0 & 0 & 5 & 6 & 7 & 0 \\ 0 & 0 & 0 & 0 & 0 & 8 \\ \end{pmatrix}\]

That’s why we need sparse representations.

Different Representations

There are two different sparse representations we’re going to use.

The first is the Coordinate or i, j, v format. It’s called that because we store the matrix as pairs of i, j coordinates and v values.

i = [0, 1, 2, 3, 4, 5, 6]
j = [6, 1, 4, 3, 2, 0, 5]
v = [1, 2, 3, 4, 5, 6, 7]

m = sp.coo_matrix((v, (i, j)), shape=(7,7))

Lets convert the matrix to the dense form to see it:

print(str(m.todense()).replace("0", " "))

[[            1]
 [  2          ]
 [        3    ]
 [      4      ]
 [    5        ]
 [6            ]
 [          7  ]]

We can create matrices that are very large. A $10,000\times 10,000$ matrix of float64 will take 1.6 GB of memory to create.

size=20000
i = list(range(size)) * 200
j = list(range(size)) * 200
v = list(range(1, size + 1)) * 200
shuffle(j)
m = sp.coo_matrix((v, (i, j)), shape=(size, size))
m

str((m.col.nbytes + m.row.nbytes + m.data.nbytes) // 1024 // 1024) + " MB"

The most important operation with matrices is matrix multiplication. If we try to use this format to perform matrix multiplication, we’ll have to repeatedly search over the entire list. We need a more efficient way to store the sparse values.

That’s where the Compressed Sparse Row representation comes in:

md = m.tocsr()
md

str((md.data.nbytes + md.indptr.nbytes + md.indices.nbytes) // 1024 // 1024) + " MB"

The Compressed Sparse Row representation stores each row as a (sometimes sorted) list of (index, values) pairs. This allows for much quicker math when the size and density of the matrix is sufficiently large. Here we try it on a matrix of about 1% density:

%%timeit
m * m

8.97 s ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
md * md

2.76 s ± 6.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)