The best of both worlds - python and q interpreters in one process; with an iPython notebook frontend.

This is just a quick example using pyq; check out the full article by Himanshu Gupta.

Both tools are great for data analysis, but if performance is a concern you may find the numbers at the bottom interesting.

Vehicle collision data courtesy of NYC Open Data.

In [1]:
import pandas as pd
from pyq import q

In [2]:
%load_ext pyq.magic

%%q

For running q code in the interpreter - can either use the q(...) function or the ipython magic %%q. q variables (K types) are available to python via the 'q' namespace, e.g. q.var1

In [3]:
%%q
a:("DTSIFF*IIIIIIIII***";enlist ",")0:`:NYPD_Motor_Vehicle_Collisions.csv // Read in csv file
a:(cols .Q.id a) xcol a  // remove the spaces in the column names
3#a

Out[3]:
DATE       TIME         BOROUGH  ZIPCODE LATITUDE LONGITUDE LOCATION          ..
------------------------------------------------------------------------------..
2015.07.04 17:50:00.000 QUEENS   11413   40.65954 -73.75413 "(40.6595418, -73...
2015.07.04 17:46:00.000 BROOKLYN 11224   40.5825  -73.97455 "(40.582503, -73.9..
2015.07.04 17:45:00.000 BROOKLYN 11211   40.71469 -73.94269 "(40.7146912, -73...

In [4]:
%%q
count a  // largish

Out[4]:
616546

In [5]:
type (q.a)  # native K types, as per k.h

Out[5]:
pyq.K

In [6]:
df = pd.read_csv("NYPD_Motor_Vehicle_Collisions.csv",parse_dates=[['DATE', 'TIME']])  # python's turn to read file 

In [7]:
df.columns = [c.replace(' ', '') for c in df.columns]  # remove the spaces in column names

In [8]:
df[:3]

Out[8]:
DATE_TIME BOROUGH ZIPCODE LATITUDE LONGITUDE LOCATION ONSTREETNAME CROSSSTREETNAME OFFSTREETNAME NUMBEROFPERSONSINJURED ... CONTRIBUTINGFACTORVEHICLE2 CONTRIBUTINGFACTORVEHICLE3 CONTRIBUTINGFACTORVEHICLE4 CONTRIBUTINGFACTORVEHICLE5 UNIQUEKEY VEHICLETYPECODE1 VEHICLETYPECODE2 VEHICLETYPECODE3 VEHICLETYPECODE4 VEHICLETYPECODE5
0 2015-07-04 17:50:00 QUEENS 11413 40.659542 -73.754128 (40.6595418, -73.7541276) 146 AVENUE 227 STREET NaN 0 ... Unspecified NaN NaN NaN 3252504 SPORT UTILITY / STATION WAGON UNKNOWN NaN NaN NaN
1 2015-07-04 17:46:00 BROOKLYN 11224 40.582503 -73.974554 (40.582503, -73.9745544) SHELL ROAD WEST 6 STREET NaN 0 ... Unspecified NaN NaN NaN 3252868 OTHER UNKNOWN NaN NaN NaN
2 2015-07-04 17:45:00 BROOKLYN 11211 40.714691 -73.942688 (40.7146912, -73.9426881) METROPOLITAN AVENUE HUMBOLDT STREET NaN 1 ... Unspecified NaN NaN NaN 3252475 PASSENGER VEHICLE BICYCLE NaN NaN NaN

3 rows × 28 columns

Timing the q aggregation:

In [9]:
%timeit -n 100 q('grp:select sum NUMBEROFPERSONSINJURED,sum NUMBEROFPERSONSKILLED by DATE.year from a')

100 loops, best of 3: 27 ms per loop

In [10]:
q.grp

Out[10]:
year| NUMBEROFPERSONSINJURED NUMBEROFPERSONSKILLED
----| --------------------------------------------
2012| 27445                  137                  
2013| 55112                  297                  
2014| 51198                  262                  
2015| 24011                  120

Timing the pandas aggregation

7-8x slower w/ pandas:

In [11]:
%timeit -n 100 grp=df[['NUMBEROFPERSONSINJURED','NUMBEROFPERSONSKILLED']].groupby(df.DATE_TIME.dt.year).sum()

100 loops, best of 3: 195 ms per loop

In [12]:
df[['NUMBEROFPERSONSINJURED','NUMBEROFPERSONSKILLED']].groupby(df.DATE_TIME.dt.year).sum()

Out[12]:
NUMBEROFPERSONSINJURED NUMBEROFPERSONSKILLED
2012 27445 137
2013 55112 297
2014 51198 262
2015 24011 120

Comments

comments powered by Disqus