Home ¶

Pandas¶

Pandas is a python package providing various data structures and routines for the analysis of data. It's popularity has made it the defacto library for analytics

The following tutorial is based on the lesson found at http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/

# we begin by importing the required packages
import pyodbc as pydb
import pandas as pd
import numpy  as np
import matplotlib.pyplot as plt
pd.set_option('max_columns',50)
%matplotlib inline

Series¶

We begin by looking at series: a one-dimensional ndarray, like a vector or list, with axis labels.

SYNTAX:
pandas.Series( data[, index, dtype, copy])
Data := can be a list, dictionary, array, or simply a scalar
Index := must be unique and hashable. Will be assigned a zero based numerical index if not given.
dtype := is the data type. Will be inferred if omitted

Constructing¶

empty_series = pd.Series()
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'])

# You can also assign an index as follows
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'],
              index=['A', 'Z', 'C', 'Y', 'E'])

# If you use a dictionary the keys will form the index
d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 
     'San Francisco': 1100,'Austin': 450, 'Boston': None}
cities = pd.Series(d)

Selecting¶

cities[0]                         # By Index
cities['New York']                # By Index Name
cities['New York','Portland']     # fails!!!!
cities[['New York','Portland']]   # Select multiple indices
cities[cities <= 900]             # By condition
cities[:2]                        # first 2 elements
print('Toronto' in cities)        # Existence check
cities % 3                        # Math ops using a scalar are applied elementwise

# isnull() and notnull() are used to filter out nulls
print(cities[cities.isnull()])
print(cities[cities.notnull()])

Updating & Filtering¶

cities['San Francisco'] = 999     # to change a value
cities[cities <= 900] = 111       # multiple updates
print('Toronto' in cities)        # Existence check
cities % 3                        # Math ops using a scalar are applied elementwise

    # Addition of two Series results in a union of index values and an intersection of their values
print(cities[['Chicago', 'New York', 'Portland']] + cities[['Austin', 'New York']])
    #Only New York has a value because it is the only element in both of the Series

# isnull() and notnull() are used to filter out nulls
print(cities[cities.isnull()])
print(cities[cities.notnull()])

Series are not terribly exciting so we move on to dataframes which is where the real beauty lies in Pandas

Dataframes¶

Similar to R dataframes or even a table in most RDBMS

Simple construction¶

data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])

File construction¶

from_csv = pd.read_csv('mariano-rivera.csv')
from_csv.head()

# More often than not you'll be reading CSVs or other data into a dataframe
from_csv = pd.read_csv('mariano-rivera.csv')
from_csv.head()

# Here's a fuller illustration
#The column names we want to assign
cols = ['num', 'game', 'date', 'team', 'home_away', 'opponent','result'
        'quarter', 'distance', 'receiver', 'score_before','score_after']
no_headers = pd.read_csv('peyton-passing-TDs-2012.csv', 
                         sep=',', 
                         header=None,
                         names=cols)
no_headers.head()

There are a lot of options like skipping lines, parsing dates etc. The documentation is your friend here

And for every read function there's a write as well

Next we will write to an excel (Note: Reading from an excel requires the xlrd lib)

football.head()
football.to_excel('football.xlsx',index=False)

#Now we'll delete the dataframe
del football
#And read it back in
soccer = pd.read_excel('football.xlsx','Sheet1')
soccer  #Voila

#Now we'll connect to a db and pull some data
cnxn = pydb.connect("Driver={SQL Server Native Client 11.0};"
                      "Server=Someone\MZRSQLExpress;"
                      "Database=AdventureWorks2012;"
                      "Trusted_Connection=yes;")
cursor = cnxn.cursor()
cursor.execute('SELECT top 10 PersonType,LastName,MiddleName,FirstName FROM Person.Person')

<pyodbc.Cursor at 0x250064e1e70>

for row in cursor:
    print('row = %r' % (row,))

row = ('EM', 'Sánchez', 'J', 'Ken')
row = ('EM', 'Duffy', 'Lee', 'Terri')
row = ('EM', 'Tamburello', None, 'Roberto')
row = ('EM', 'Walters', None, 'Rob')
row = ('EM', 'Erickson', 'A', 'Gail')
row = ('EM', 'Goldberg', 'H', 'Jossef')
row = ('EM', 'Miller', 'A', 'Dylan')
row = ('EM', 'Margheim', 'L', 'Diane')
row = ('EM', 'Matthew', 'N', 'Gigi')
row = ('EM', 'Raheem', None, 'Michael')

#The panda way is to read it into a dataframe
query = "SELECT top 10 PersonType,LastName,MiddleName,FirstName FROM Person.Person"
results = pd.read_sql(query, con=cnxn)
results.head()

# Copying from a clipboard can be finicky
hank = pd.read_clipboard()
hank.head()

	Year	Age	Tm	Lg	W	L	W-L%	ERA	G	GS	GF	SV	IP	H	R	ER	HR	BB	IBB	SO	HBP	BK	WP	BF	ERA+	WHIP	H/9	HR/9	BB/9	SO/9	SO/BB	Awards
0	1995	25	NYY	AL	5	3	0.625	5.51	19	10	2	0	67.0	71	43	41	11	30	0	51	2	1	0	301	84	1.507	9.5	1.5	4.0	6.9	1.70	NaN
1	1996	26	NYY	AL	8	3	0.727	2.09	61	0	14	5	107.2	73	25	25	1	34	3	130	2	0	1	425	240	0.994	6.1	0.1	2.8	10.9	3.82	CYA-3MVP-12
2	1997	27	NYY	AL	6	4	0.600	1.88	66	0	56	43	71.2	65	17	15	5	20	6	68	0	0	2	301	239	1.186	8.2	0.6	2.5	8.5	3.40	ASMVP-25
3	1998	28	NYY	AL	3	0	1.000	1.91	54	0	49	36	61.1	48	13	13	3	17	1	36	1	0	0	246	233	1.060	7.0	0.4	2.5	5.3	2.12	NaN
4	1999	29	NYY	AL	4	3	0.571	1.83	66	0	63	45	69.0	43	15	14	2	18	3	52	3	1	2	268	257	0.884	5.6	0.3	2.3	6.8	2.89	ASCYA-3MVP-14

	num	game	date	team	home_away	opponent	resultquarter	distance	receiver	score_before	score_after
1	1	2012-09-09	DEN	NaN	PIT	W 31-19	3	71	Demaryius Thomas	Trail 7-13	Lead 14-13*
2	1	2012-09-09	DEN	NaN	PIT	W 31-19	4	1	Jacob Tamme	Trail 14-19	Lead 22-19*
3	2	2012-09-17	DEN	@	ATL	L 21-27	2	17	Demaryius Thomas	Trail 0-20	Trail 7-20
4	3	2012-09-23	DEN	NaN	HOU	L 25-31	4	38	Brandon Stokley	Trail 11-31	Trail 18-31
5	3	2012-09-23	DEN	NaN	HOU	L 25-31	4	6	Joel Dreessen	Trail 18-31	Trail 25-31

	Ticker	Date	shares	$-Open	Comm	BookVal	Date.1	Shares	$-Close	Comm .1
0	AMD	12/28/2016	4,180	$11.9600	$9.95	$49,982.85	1/12/2017	4,180	$10.7707	10.93
1	UGAZ	1/12/2017	1,630	$31.9000	$9.95	$51,987.05	1/12/2017	1,630	$31.0500	11.05
2	JNUG	1/17/2017	5,640	$9.1200	$-	$51,436.80	1/18/2017	5,640	$8.9000	11.04
3	DSLV	1/18/2017	1,855	$27.0500	$9.95	$50,167.80	1/18/2017	1,855	$26.7500	11.03
4	JNUG	1/18/2017	5,840	$8.4297	$-	$49,229.45	1/18/2017	5,840	$8.1311	10.99

	year	team	wins	losses
0	2010	Bears	11	5
1	2011	Bears	8	8
2	2012	Bears	10	6
3	2011	Packers	15	1
4	2012	Packers	11	5
5	2010	Lions	6	10
6	2011	Lions	10	6
7	2012	Lions	4	12

	PersonType	LastName	MiddleName	FirstName
0	EM	Sánchez	J	Ken
1	EM	Duffy	Lee	Terri
2	EM	Tamburello	None	Roberto
3	EM	Walters	None	Rob
4	EM	Erickson	A	Gail

Home¶