Home

Pandas

Pandas is a python package providing various data structures and routines for the analysis of data. It's popularity has made it the defacto library for analytics

The following tutorial is based on the lesson found at http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/

In [2]:
# we begin by importing the required packages
import pyodbc as pydb
import pandas as pd
import numpy  as np
import matplotlib.pyplot as plt
pd.set_option('max_columns',50)
%matplotlib inline

Series

We begin by looking at series: a one-dimensional ndarray, like a vector or list, with axis labels.

SYNTAX:
pandas.Series( data[, index, dtype, copy])
Data := can be a list, dictionary, array, or simply a scalar
Index := must be unique and hashable. Will be assigned a zero based numerical index if not given.
dtype := is the data type. Will be inferred if omitted

Constructing

empty_series = pd.Series()
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'])

# You can also assign an index as follows
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'],
              index=['A', 'Z', 'C', 'Y', 'E'])

# If you use a dictionary the keys will form the index
d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 
     'San Francisco': 1100,'Austin': 450, 'Boston': None}
cities = pd.Series(d)

Selecting

cities[0]                         # By Index
cities['New York']                # By Index Name
cities['New York','Portland']     # fails!!!!
cities[['New York','Portland']]   # Select multiple indices
cities[cities <= 900]             # By condition
cities[:2]                        # first 2 elements
print('Toronto' in cities)        # Existence check
cities % 3                        # Math ops using a scalar are applied elementwise

# isnull() and notnull() are used to filter out nulls
print(cities[cities.isnull()])
print(cities[cities.notnull()])

Updating & Filtering

cities['San Francisco'] = 999     # to change a value
cities[cities <= 900] = 111       # multiple updates
print('Toronto' in cities)        # Existence check
cities % 3                        # Math ops using a scalar are applied elementwise

    # Addition of two Series results in a union of index values and an intersection of their values
print(cities[['Chicago', 'New York', 'Portland']] + cities[['Austin', 'New York']])
    #Only New York has a value because it is the only element in both of the Series

# isnull() and notnull() are used to filter out nulls
print(cities[cities.isnull()])
print(cities[cities.notnull()])

Series are not terribly exciting so we move on to dataframes which is where the real beauty lies in Pandas

Dataframes

Similar to R dataframes or even a table in most RDBMS

Simple construction

data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])

File construction

from_csv = pd.read_csv('mariano-rivera.csv')
from_csv.head()
In [17]:
# More often than not you'll be reading CSVs or other data into a dataframe
from_csv = pd.read_csv('mariano-rivera.csv')
from_csv.head()
Out[17]:
Year Age Tm Lg W L W-L% ERA G GS GF CG SHO SV IP H R ER HR BB IBB SO HBP BK WP BF ERA+ WHIP H/9 HR/9 BB/9 SO/9 SO/BB Awards
0 1995 25 NYY AL 5 3 0.625 5.51 19 10 2 0 0 0 67.0 71 43 41 11 30 0 51 2 1 0 301 84 1.507 9.5 1.5 4.0 6.9 1.70 NaN
1 1996 26 NYY AL 8 3 0.727 2.09 61 0 14 0 0 5 107.2 73 25 25 1 34 3 130 2 0 1 425 240 0.994 6.1 0.1 2.8 10.9 3.82 CYA-3MVP-12
2 1997 27 NYY AL 6 4 0.600 1.88 66 0 56 0 0 43 71.2 65 17 15 5 20 6 68 0 0 2 301 239 1.186 8.2 0.6 2.5 8.5 3.40 ASMVP-25
3 1998 28 NYY AL 3 0 1.000 1.91 54 0 49 0 0 36 61.1 48 13 13 3 17 1 36 1 0 0 246 233 1.060 7.0 0.4 2.5 5.3 2.12 NaN
4 1999 29 NYY AL 4 3 0.571 1.83 66 0 63 0 0 45 69.0 43 15 14 2 18 3 52 3 1 2 268 257 0.884 5.6 0.3 2.3 6.8 2.89 ASCYA-3MVP-14
In [18]:
# Here's a fuller illustration
#The column names we want to assign
cols = ['num', 'game', 'date', 'team', 'home_away', 'opponent','result'
        'quarter', 'distance', 'receiver', 'score_before','score_after']
no_headers = pd.read_csv('peyton-passing-TDs-2012.csv', 
                         sep=',', 
                         header=None,
                         names=cols)
no_headers.head()
Out[18]:
num game date team home_away opponent resultquarter distance receiver score_before score_after
1 1 2012-09-09 DEN NaN PIT W 31-19 3 71 Demaryius Thomas Trail 7-13 Lead 14-13*
2 1 2012-09-09 DEN NaN PIT W 31-19 4 1 Jacob Tamme Trail 14-19 Lead 22-19*
3 2 2012-09-17 DEN @ ATL L 21-27 2 17 Demaryius Thomas Trail 0-20 Trail 7-20
4 3 2012-09-23 DEN NaN HOU L 25-31 4 38 Brandon Stokley Trail 11-31 Trail 18-31
5 3 2012-09-23 DEN NaN HOU L 25-31 4 6 Joel Dreessen Trail 18-31 Trail 25-31

There are a lot of options like skipping lines, parsing dates etc. The documentation is your friend here

And for every read function there's a write as well

Next we will write to an excel (Note: Reading from an excel requires the xlrd lib)

In [19]:
football.head()
football.to_excel('football.xlsx',index=False)
In [20]:
#Now we'll delete the dataframe
del football
#And read it back in
soccer = pd.read_excel('football.xlsx','Sheet1')
soccer  #Voila
Out[20]:
year team wins losses
0 2010 Bears 11 5
1 2011 Bears 8 8
2 2012 Bears 10 6
3 2011 Packers 15 1
4 2012 Packers 11 5
5 2010 Lions 6 10
6 2011 Lions 10 6
7 2012 Lions 4 12
In [31]:
#Now we'll connect to a db and pull some data
cnxn = pydb.connect("Driver={SQL Server Native Client 11.0};"
                      "Server=Someone\MZRSQLExpress;"
                      "Database=AdventureWorks2012;"
                      "Trusted_Connection=yes;")
cursor = cnxn.cursor()
cursor.execute('SELECT top 10 PersonType,LastName,MiddleName,FirstName FROM Person.Person')   
Out[31]:
<pyodbc.Cursor at 0x250064e1e70>
In [32]:
for row in cursor:
    print('row = %r' % (row,))
row = ('EM', 'Sánchez', 'J', 'Ken')
row = ('EM', 'Duffy', 'Lee', 'Terri')
row = ('EM', 'Tamburello', None, 'Roberto')
row = ('EM', 'Walters', None, 'Rob')
row = ('EM', 'Erickson', 'A', 'Gail')
row = ('EM', 'Goldberg', 'H', 'Jossef')
row = ('EM', 'Miller', 'A', 'Dylan')
row = ('EM', 'Margheim', 'L', 'Diane')
row = ('EM', 'Matthew', 'N', 'Gigi')
row = ('EM', 'Raheem', None, 'Michael')
In [37]:
#The panda way is to read it into a dataframe
query = "SELECT top 10 PersonType,LastName,MiddleName,FirstName FROM Person.Person"
results = pd.read_sql(query, con=cnxn)
results.head()
Out[37]:
PersonType LastName MiddleName FirstName
0 EM Sánchez J Ken
1 EM Duffy Lee Terri
2 EM Tamburello None Roberto
3 EM Walters None Rob
4 EM Erickson A Gail
In [41]:
# Copying from a clipboard can be finicky
hank = pd.read_clipboard()
hank.head()
Out[41]:
Ticker Date shares $-Open Comm BookVal Date.1 Shares $-Close Comm .1
0 AMD 12/28/2016 4,180 $11.9600 $9.95 $49,982.85 1/12/2017 4,180 $10.7707 10.93
1 UGAZ 1/12/2017 1,630 $31.9000 $9.95 $51,987.05 1/12/2017 1,630 $31.0500 11.05
2 JNUG 1/17/2017 5,640 $9.1200 $- $51,436.80 1/18/2017 5,640 $8.9000 11.04
3 DSLV 1/18/2017 1,855 $27.0500 $9.95 $50,167.80 1/18/2017 1,855 $26.7500 11.03
4 JNUG 1/18/2017 5,840 $8.4297 $- $49,229.45 1/18/2017 5,840 $8.1311 10.99