Scientific Computing

Scientific Computing is the application of computer programming to scientific applications: data analysis, simulation & modelling, plotting, etc.

Scientific Python: Scipy Stack

Scipy = Scientific Python

  • scipy

  • numpy

  • pandas

  • Data Analysis in Python

Scipy is an ecosystem, including a collection of open-source packages for scientific computing in Python.

A ‘family’ of packages that all work well together to do scientific computing.

Not made by the same people who manage the standard library.

Homogenous Data

  • for example: store data of the same type (i.e. all numerics)

  • recordings of values from experimental participants

  • heights or quantitative information from survey data

Lists are a start, and lists of lists are possible.

But, they can get nightmareish.

So we use arrays.

numpy

numpy - stands for numerical python

arrays - work with arrays (matrices)

Allow you to efficiently operate on arrays (linear algebra, matrix operations, etc.)

import numpy as np
# Create some arrays of data
arr0 = np.array([1, 2, 3])
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])
arr1
array([[1, 2],
       [3, 4]])
# lists of lists don't store dimensionality well
[[1, 2], [3, 4]] 

Indexing Arrays

# Check out an array of data
arr1
# Check the shape of the array
arr1.shape
# Index into a numpy array
arr1[0, 0]

Working with N-dimensional (multidimensional) arrays is easy within numpy.

Notes on Arrays

# arrays are most helpful when they
# have the same length in each list
np.array([[1, 2, 3, 4], [2, 3, 4]])
# arrays are meant to store homogeneous data
np.array([[1, 2, 'cogs18'], [2, 3, 4]])

Working with Arrays

(Things you can’t do with lists)

# Add arrays together
arr1 + arr2
# Matrix mutliplication
arr1 * arr2

A brief aside: zip()

zip() takes two iterables (things you can loop over) and loop over them together.

for a, b in zip([1,2], ['a','b']):
    print(a, b)

Clicker Question #1

Given the following code, what will it print out?

data = np.array([[1, 2, 3, 4],
                 [5, 6, 7, 8]])
 
output = []
for d1, d2 in zip(data[0, :], data[1, :]):
    output.append(d1 + d2)

print(output)
  • A) [1, 2, 3, 4]

  • B) [1, 2, 3, 4, 5, 6, 7, 8]

  • C) [6, 8, 10, 12]

  • D) [10, 26]

  • E) [36]

Note that if you find yourself looping over arrays…there is probably a better way.

data.sum()
data.sum(axis=0)

Heterogenous Data

  • have continuous (numeric) and categorical (discrete) data

  • different data types need to be stored

  • uses a DataFrame object (think: spreadsheet)

  • allows for column and row labels

pandas

import pandas as pd
# Create some example heterogenous data
d1 = {'Subj_ID': '001', 'score': 16, 'group' : 2, 'condition': 'cognition'}
d2 = {'Subj_ID': '002', 'score': 22, 'group' : 1, 'condition': 'perception'}
# Create a dataframe 
df = pd.DataFrame([d1, d2], [0, 1])
# Check out the dataframe
df
Subj_ID condition group score
0 001 cognition 2 16
1 002 perception 1 22
# You can index in pandas
df['condition']
# You can index in pandas
df.loc[0,:]

Working with DataFrames

df.describe()
# Take the average of all numeric columns
df.mean()

Clicker Question #2

Comparing them to standard library Python types, which is the best mapping for these new data types?

  • A) DataFrames are like lists, arrays are like tuples

  • B) DataFrames and arrays are like lists

  • C) DataFrames are like tuples, arrays are like lists

  • D) DataFrames and arrays are like dictionaries

  • E) Dataframes are like dictionaries, arrays are like lists

Plotting

%matplotlib inline

import matplotlib.pyplot as plt
# Create some data
dat = np.array([1, 2, 4, 8, 16, 32])
# Plot the data
plt.plot(dat);
../_images/19-ScientificComputing_48_0.png
  • can change plot type

  • lots of customizations possible

Analysis

  • scipy - statistical analysis

  • sklearn - machine learning

import scipy as sp
from scipy import stats
# Simulate some data
d1 = stats.norm.rvs(loc=0, size=1000)
d2 = stats.norm.rvs(loc=0.5, size=1000)

Analysis - Plotting the Data

# Plot the data
plt.hist(d1, 25, alpha=0.6);
plt.hist(d2, 25, alpha=0.6);
../_images/19-ScientificComputing_54_0.png

Analysis - Statistical Comparisons

# Statistically compare the two distributions
stats.ttest_ind(d1, d2)
Ttest_indResult(statistic=-9.33809588164776, pvalue=2.5256641524949454e-20)

COGS108: Data Science in Practice

If you are interested in data science and scientific computing in Python, consider taking COGS108 : https://github.com/COGS108/.