Python Pandas, an Introduction

Python Pandas, for Data Analysis

Last updated August 10, 2022

The Pandas Data Analysis Library provides a way of bringing SQL-like sorting and querying to semi-structured data, through Python. These examples provided below were shamelessly lifted from the book, "Python for Data Analysis."

Installing Python Pandas:

From the command line, install the Python package manager pip if you haven't done so yet:

sudo apt-get install python-pip

Pandas requires numpy, so install both from pip:

sudo pip install numpy
sudo pip install pandas

And at the start of your Python program you need to alert the compiler of the necessary libraries:

from pandas import Series
from pandas import DataFrame
import pandas as pd

Working with Arrays: Series

(You can run the code below from this file)

To know pandas you need to know all about series and data frames. Let's start with a series. A series is a one-dimensional array (or object) of data and an index. Pandas will let you create a series:

obj = Series([ 13, 23, 2, 15])

If no index is present, one will be created automatically. You can create a series and define the index:

obj2 = Series([ 4, 7, -5, 3], index =['d', 'b', 'a', 'c'])
obj2['d'] = 6

Use the index to assign a certain value:

IndexedSeries['a'] = 14;

You can create a series from a Python Dict:

Dict2SeriesData = {'Monday': 2200, 'Tuesday':  3528, 'Wednesday': 123299, 'Thursday': 3234}
Dict2Series = Series(Dict2SeriesData)

Sort a Series by providing the sorting order (Note: Pandas will assign a NaN to any values it does not find):

Days = ['Wednesday', 'Friday', 'Monday', 'Tuesday']
SortedDays = Series(Dict2SeriesData, index=Days)

You can combine two series into a single one:

Dict3SeriesData = {'Monday': 1400, 'Tuesday':  10000, 'Wednesday': 5, 'Sunday': 2365}
Dict3Series = Series(Dict3SeriesData)
Dailies = Dict3Series + Dict2Series

Working with Arrays: Data Frames

A data frame is a two-dimensional labeled data structure (of potentially different data types) that resembles a spreadsheet. It has an index for both the row and the column (Operational code samples for this section are available here).

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2001, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print frame

Reading in Data:

The next example requires users.dat, ratings.dat, and movies.dat. Run the code here.

#Run these commands in iPython, or as a stand-alone Python program

import pandas as pd 

unames = [' user_id', 'gender', 'age', 'occupation', 'zip'] 
users = pd.read_table('users.dat', sep ='::', header = None, names = unames)

rnames = [' user_id', 'movie_id', 'rating', 'timestamp'] 
ratings = pd.read_table('ratings.dat', sep ='::', header = None, names = rnames) 

mnames = [' movie_id', 'title', 'genres'] 
movies = pd.read_table('movies.dat', sep ='::', header = None, names = mnames)

users[: 5]
movies[: 5]
ratings

data = pd.merge( ratings, users)

active_titles = ratings_by_title.index[ ratings_by_title > = 250]