Pandas is a python package providing various data structures and routines for the analysis of data. It's popularity has made it the defacto library for analytics
The following tutorial is based on the lesson found at http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
# we begin by importing the required packages
import pyodbc as pydb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('max_columns',50)
%matplotlib inline
We begin by looking at series: a one-dimensional ndarray, like a vector or list, with axis labels.
SYNTAX:
pandas.Series( data[, index, dtype, copy])
Data := can be a list, dictionary, array, or simply a scalar
Index := must be unique and hashable. Will be assigned a zero based numerical index if not given.
dtype := is the data type. Will be inferred if omitted
empty_series = pd.Series()
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'])
# You can also assign an index as follows
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'],
index=['A', 'Z', 'C', 'Y', 'E'])
# If you use a dictionary the keys will form the index
d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900,
'San Francisco': 1100,'Austin': 450, 'Boston': None}
cities = pd.Series(d)
cities[0] # By Index
cities['New York'] # By Index Name
cities['New York','Portland'] # fails!!!!
cities[['New York','Portland']] # Select multiple indices
cities[cities <= 900] # By condition
cities[:2] # first 2 elements
print('Toronto' in cities) # Existence check
cities % 3 # Math ops using a scalar are applied elementwise
# isnull() and notnull() are used to filter out nulls
print(cities[cities.isnull()])
print(cities[cities.notnull()])
cities['San Francisco'] = 999 # to change a value
cities[cities <= 900] = 111 # multiple updates
print('Toronto' in cities) # Existence check
cities % 3 # Math ops using a scalar are applied elementwise
# Addition of two Series results in a union of index values and an intersection of their values
print(cities[['Chicago', 'New York', 'Portland']] + cities[['Austin', 'New York']])
#Only New York has a value because it is the only element in both of the Series
# isnull() and notnull() are used to filter out nulls
print(cities[cities.isnull()])
print(cities[cities.notnull()])
Series are not terribly exciting so we move on to dataframes which is where the real beauty lies in Pandas
Similar to R dataframes or even a table in most RDBMS
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
'wins': [11, 8, 10, 15, 11, 6, 10, 4],
'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])
from_csv = pd.read_csv('mariano-rivera.csv')
from_csv.head()
# More often than not you'll be reading CSVs or other data into a dataframe
from_csv = pd.read_csv('mariano-rivera.csv')
from_csv.head()
# Here's a fuller illustration
#The column names we want to assign
cols = ['num', 'game', 'date', 'team', 'home_away', 'opponent','result'
'quarter', 'distance', 'receiver', 'score_before','score_after']
no_headers = pd.read_csv('peyton-passing-TDs-2012.csv',
sep=',',
header=None,
names=cols)
no_headers.head()
There are a lot of options like skipping lines, parsing dates etc. The documentation is your friend here
And for every read function there's a write as well
Next we will write to an excel (Note: Reading from an excel requires the xlrd lib)
football.head()
football.to_excel('football.xlsx',index=False)
#Now we'll delete the dataframe
del football
#And read it back in
soccer = pd.read_excel('football.xlsx','Sheet1')
soccer #Voila
#Now we'll connect to a db and pull some data
cnxn = pydb.connect("Driver={SQL Server Native Client 11.0};"
"Server=Someone\MZRSQLExpress;"
"Database=AdventureWorks2012;"
"Trusted_Connection=yes;")
cursor = cnxn.cursor()
cursor.execute('SELECT top 10 PersonType,LastName,MiddleName,FirstName FROM Person.Person')
for row in cursor:
print('row = %r' % (row,))
#The panda way is to read it into a dataframe
query = "SELECT top 10 PersonType,LastName,MiddleName,FirstName FROM Person.Person"
results = pd.read_sql(query, con=cnxn)
results.head()
# Copying from a clipboard can be finicky
hank = pd.read_clipboard()
hank.head()