⭳ Download Jupyter Notebook

Recitation 2

Homework Tips

TA Hours

import pandas as pd
import sqlite3
import gzip
import scipy.sparse as sp
from random import shuffle

Views and Data

Pandas loads data and presents you with a view of it. Most read-only operations on the DataFrame change the view but leave the underlying data intact; this allows you to work quickly (because copying is expensive) with very large datasets (big enough that you can only keep one copy in memory).

Here’s an example using a modified version of the Crimes in Boston dataset (CC0):

with gzip.open("crime_min.csv.gz", "rt", encoding="UTF-8") as crime_file:
    crime = pd.read_csv(crime_file)
    crime["date"] = pd.to_datetime(crime['date'], infer_datetime_format=True)

with gzip.open("offense_codes.csv.gz", "rt", encoding="UTF-8") as offense_file:
    offense = pd.read_csv(offense_file).drop_duplicates("code")
crime
incident_num offense_code district date lat lon
0 I182070945 619 D14 2018-09-02 13:00:00 42.357791 -71.139371
1 I182070943 1402 C11 2018-08-21 00:00:00 42.306821 -71.060300
2 I182070941 3410 D4 2018-09-03 19:27:00 42.346589 -71.072429
3 I182070940 3114 D4 2018-09-03 21:16:00 42.334182 -71.078664
4 I182070938 3114 B3 2018-09-03 21:05:00 42.275365 -71.090361
... ... ... ... ... ... ...
319068 I050310906-00 3125 D4 2016-06-05 17:25:00 42.336951 -71.085748
319069 I030217815-08 111 E18 2015-07-09 13:38:00 42.255926 -71.123172
319070 I030217815-08 3125 E18 2015-07-09 13:38:00 42.255926 -71.123172
319071 I010370257-00 3125 E13 2016-05-31 19:35:00 42.302333 -71.111565
319072 142052550 3125 D4 2015-06-22 00:12:00 42.333839 -71.080290

319073 rows × 6 columns

It uses a decent amount of memory:

crime.memory_usage(deep=True)
str(sum(crime.memory_usage(deep=True)) // 1024 // 1024) + " MB"

This doesn’t seem like much, but datasets get very large very quickly. Let’s select all reports of “GATHERING CAUSING ANNOYANCE”:

gca = crime[crime.offense_code == 3302]
gca
incident_num offense_code district date lat lon
24331 I182044697 3302 D4 2018-06-09 14:36:00 42.351251 -71.073052
37259 I182030938 3302 D4 2018-04-25 12:54:00 42.341318 -71.078784
56534 I182010380 3302 A1 2018-02-08 17:20:00 42.355407 -71.063124
62970 I182003526 3302 E5 2018-01-13 23:31:00 42.286228 -71.124498
63220 I182003288 3302 B2 2018-01-12 23:55:00 42.333220 -71.109439
150827 I172017387 3302 A1 2017-03-04 11:13:00 42.353110 -71.064323
160521 I172007037 3302 A1 2017-01-26 14:59:00 NaN NaN
165072 I172002246 3302 A1 2017-01-09 12:04:00 42.356502 -71.062000
177200 I162095504 3302 C11 2016-11-22 05:20:00 42.304815 -71.072183
186443 I162085653 3302 A1 2016-10-19 12:18:00 NaN NaN
222475 I162046897 3302 B2 2016-06-13 23:56:00 42.327541 -71.099500
227620 I162041405 3302 B2 2016-05-27 02:37:00 42.332590 -71.100314
247973 I162019520 3302 C11 2016-03-13 00:20:00 42.295072 -71.047497
251477 I162015700 3302 B2 2016-02-27 23:12:00 42.329751 -71.098977
270919 I152102472 3302 B2 2015-12-12 00:33:00 42.330570 -71.099591
271448 I152101891 3302 B2 2015-12-09 23:38:00 42.331286 -71.102540
282045 I152090122 3302 B2 2015-10-30 22:01:00 42.334278 -71.102952
285434 I152086419 3302 C11 2015-10-17 16:14:00 42.303565 -71.078681
285585 I152086244 3302 B2 2015-10-17 01:10:00 42.329645 -71.097472
287262 I152084444 3302 B2 2015-10-11 00:33:00 42.337683 -71.096668
291103 I152080302 3302 B2 2015-09-26 17:45:00 42.327016 -71.105551
291260 I152080126 3302 B2 2015-09-26 02:05:00 42.331513 -71.104949
291262 I152080124 3302 B2 2015-09-26 02:59:00 42.334097 -71.102264
293338 I152077876 3302 B2 2015-09-19 00:25:00 42.331348 -71.103225
293411 I152077794 3302 E5 2015-09-18 18:01:00 42.285370 -71.172440
295087 I152075994 3302 B2 2015-09-13 00:19:00 42.331836 -71.104329
310780 I152058711 3302 C6 2015-07-16 06:52:00 42.353208 -71.046471

We see that the created frame is a copy. It has a weak reference to the original DataFrame:

gca._is_copy

We print out the index and see that it refers to the selected rows in the original DataFrame by row id:

gca.index

Making a copy creates a new backing array, independent of the original data:

gca_copy = gca.copy()
print(gca_copy._is_copy)
None

Databases and Indices

pandas and RDBMS software work off of indices. These exist to speed up queries.

The reason indexes are so important is because of data normalization. By allowing us to efficiently connect information dispersed over several tables, we can store data in a way that minimizes redundancy and maximizes throughput. Here’s a basic introduction to the concept.

speed_crime = crime.copy()
speed_crime.index = pd.DatetimeIndex(speed_crime.date)
str(sum(speed_crime.memory_usage(deep=True)) // 1024 // 1024) + " MB"
%%timeit
entries = []
fr, to = pd.to_datetime("2016-11-01"), pd.to_datetime("2016-12-01")
for idx, row in speed_crime.iterrows():
    if (idx > fr)  and (idx < to):
        entries.append(row)
%%timeit
speed_crime.loc["2016-11-01":"2016-12-01"]
2.52 ms ± 32.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It’s a little quicker! This is why you’re expected to complete hw2_time_series without looping over the data.

The chief gains in this are from vectorization (which is running as much of the code sequentially as low-level as possible), and then from indexing (which is preparing a lookup data structure to find parts of the data more efficiently).

Join operation

The use of an index allows us to join two tables together using common column values. Lets take a look at the offense table:

offense[offense.code.isin([3302, 3116, 1103, 3122, 2401])]
code name
6 1103 CONFIDENCE GAMES
135 2401 AFFRAY
261 3116 HARBOR INCIDENT / VIOLATION
266 3122 AIRCRAFT INCIDENTS
299 3302 GATHERING CAUSING ANNOYANCE

We’ve previously set the code as the index:

offense_indexed = offense.set_index("code")
offense_indexed
name
code
1001 COUNTERFEITING
1002 FORGERY OR UTTERING
1101 PASSING WORTHLESS CHECK
1102 FRAUD - FALSE PRETENSE
1103 CONFIDENCE GAMES
... ...
905 ARSON - OTHER COMMERCIAL
906 ARSON - COMMUNITY/PUB.STRUC.
907 ARSON - ALL OTHER STRUCTURES
920 ARSON - MOTOR VEHICLES
930 ARSON - OTHER

425 rows × 1 columns

Now we can join the two tables, selecting the offense.name where crime.offense_code == offense.code:

crime.join(offense_indexed, on="offense_code", how="left")
incident_num offense_code district date lat lon name
0 I182070945 619 D14 2018-09-02 13:00:00 42.357791 -71.139371 LARCENY ALL OTHERS
1 I182070943 1402 C11 2018-08-21 00:00:00 42.306821 -71.060300 VANDALISM
2 I182070941 3410 D4 2018-09-03 19:27:00 42.346589 -71.072429 TOWED MOTOR VEHICLE
3 I182070940 3114 D4 2018-09-03 21:16:00 42.334182 -71.078664 INVESTIGATE PROPERTY
4 I182070938 3114 B3 2018-09-03 21:05:00 42.275365 -71.090361 INVESTIGATE PROPERTY
... ... ... ... ... ... ... ...
319068 I050310906-00 3125 D4 2016-06-05 17:25:00 42.336951 -71.085748 WARRANT ARREST
319069 I030217815-08 111 E18 2015-07-09 13:38:00 42.255926 -71.123172 MURDER, NON-NEGLIGIENT MANSLAUGHTER
319070 I030217815-08 3125 E18 2015-07-09 13:38:00 42.255926 -71.123172 WARRANT ARREST
319071 I010370257-00 3125 E13 2016-05-31 19:35:00 42.302333 -71.111565 WARRANT ARREST
319072 142052550 3125 D4 2015-06-22 00:12:00 42.333839 -71.080290 WARRANT ARREST

319073 rows × 7 columns

Once More With SQLite

conn = sqlite3.connect(':memory:')
crime.reset_index().to_sql("crime", conn, if_exists="replace")
offense.reset_index().to_sql("offense", conn, if_exists="replace")
conn.execute("PRAGMA automatic_index = false;") # Disable automatic indexing for this demo
conn.commit()
conn.execute("SELECT * FROM `crime`;").fetchmany(2)
conn.execute("SELECT * FROM `offense`;").fetchmany(2)

SQLite Indexes and Join

We can set SQLite indexes up pretty easily. Lets try the join query from earlier without one first:

join_query = "SELECT * FROM `crime` LEFT JOIN `offense` ON crime.`offense_code` = offense.`code`;"
conn.execute(join_query).fetchmany(2)

It works, but how does it happen? We can use the EXPLAIN QUERY PLAN instruction to get SQLite to tell us its strategy.

conn.execute(f"EXPLAIN QUERY PLAN {join_query}").fetchall()

SCAN TABLE means it searches the table by looping over it, the slowest way of doing this. What this means is that SQLite has to loop over the offense table once for each entry in crime.

Now lets create an index and do this again:

conn.execute(f"CREATE INDEX code_index ON offense(`code`)")
conn.execute(f"EXPLAIN QUERY PLAN {join_query}").fetchall()

SEARCH TABLE means that the index code_index is used to efficiently locate entries inside offense. This means that SQLite has to efficiently find entries from offense for each entry in crime.

Matrices and Sparse Representation

There’s a basic problem with matrices. An $n \times m$ matrix of type float64 takes $n\times m\times 8$ bytes of memory. That adds up to a lot of space.

\[\begin{pmatrix} 1 & 2 & 0 & 0 & 0 & 0 \\ 0 & 3 & 0 & 4 & 0 & 0 \\ 0 & 0 & 5 & 6 & 7 & 0 \\ 0 & 0 & 0 & 0 & 0 & 8 \\ \end{pmatrix}\]

That’s why we need sparse representations.

Different Representations

There are two different sparse representations we’re going to use.

The first is the Coordinate or i, j, v format. It’s called that because we store the matrix as pairs of i, j coordinates and v values.

i = [0, 1, 2, 3, 4, 5, 6]
j = [6, 1, 4, 3, 2, 0, 5]
v = [1, 2, 3, 4, 5, 6, 7]

m = sp.coo_matrix((v, (i, j)), shape=(7,7))

Lets convert the matrix to the dense form to see it:

print(str(m.todense()).replace("0", " "))
[[            1]
 [  2          ]
 [        3    ]
 [      4      ]
 [    5        ]
 [6            ]
 [          7  ]]

We can create matrices that are very large. A $10,000\times 10,000$ matrix of float64 will take 1.6 GB of memory to create.

size=20000
i = list(range(size)) * 200
j = list(range(size)) * 200
v = list(range(1, size + 1)) * 200
shuffle(j)
m = sp.coo_matrix((v, (i, j)), shape=(size, size))
m
str((m.col.nbytes + m.row.nbytes + m.data.nbytes) // 1024 // 1024) + " MB"

The most important operation with matrices is matrix multiplication. If we try to use this format to perform matrix multiplication, we’ll have to repeatedly search over the entire list. We need a more efficient way to store the sparse values.

That’s where the Compressed Sparse Row representation comes in:

md = m.tocsr()
md
str((md.data.nbytes + md.indptr.nbytes + md.indices.nbytes) // 1024 // 1024) + " MB"

The Compressed Sparse Row representation stores each row as a (sometimes sorted) list of (index, values) pairs. This allows for much quicker math when the size and density of the matrix is sufficiently large. Here we try it on a matrix of about 1% density:

%%timeit
m * m
8.97 s ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
md * md
2.76 s ± 6.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)