Scientific Computing

Contents

Scientific Computing¶

Scientific Computing is the application of computer programming to scientific applications: data analysis, simulation & modelling, plotting, etc.

Scientific Python: Scipy Stack¶

Scipy = Scientific Python

scipy
numpy
pandas
Data Analysis in Python

Scipy is an ecosystem, including a collection of open-source packages for scientific computing in Python.

A ‘family’ of packages that all work well together to do scientific computing.

Not made by the same people who manage the standard library.

Homogenous Data¶

for example: store data of the same type (i.e. all numerics)
recordings of values from experimental participants
heights or quantitative information from survey data

Lists are a start, and lists of lists are possible.

But, they can get nightmareish.

So we use arrays.

`numpy`¶

numpy - stands for numerical python

arrays - work with arrays (matrices)

Allow you to efficiently operate on arrays (linear algebra, matrix operations, etc.)

import numpy as np

# Create some arrays of data
arr0 = np.array([1, 2, 3])
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])

arr1

array([[1, 2],
       [3, 4]])

# lists of lists don't store dimensionality well
[[1, 2], [3, 4]] 

Indexing Arrays¶

# Check out an array of data
arr1

# Check the shape of the array
arr1.shape

# Index into a numpy array
arr1[0, 0]

Working with N-dimensional (multidimensional) arrays is easy within numpy.

Notes on Arrays¶

# arrays are most helpful when they
# have the same length in each list
np.array([[1, 2, 3, 4], [2, 3, 4]])

# arrays are meant to store homogeneous data
np.array([[1, 2, 'cogs18'], [2, 3, 4]])

Working with Arrays¶

(Things you can’t do with lists)

# Add arrays together
arr1 + arr2

# Matrix mutliplication
arr1 * arr2

A brief aside: `zip()`¶

zip() takes two iterables (things you can loop over) and loop over them together.

for a, b in zip([1,2], ['a','b']):
    print(a, b)

Clicker Question #1¶

Given the following code, what will it print out?

data = np.array([[1, 2, 3, 4],
                 [5, 6, 7, 8]])
 
output = []
for d1, d2 in zip(data[0, :], data[1, :]):
    output.append(d1 + d2)

print(output)

A) [1, 2, 3, 4]
B) [1, 2, 3, 4, 5, 6, 7, 8]
C) [6, 8, 10, 12]
D) [10, 26]
E) [36]

Note that if you find yourself looping over arrays…there is probably a better way.

data.sum()

data.sum(axis=0)

Heterogenous Data¶

have continuous (numeric) and categorical (discrete) data
different data types need to be stored
uses a DataFrame object (think: spreadsheet)
allows for column and row labels

pandas¶

import pandas as pd

# Create some example heterogenous data
d1 = {'Subj_ID': '001', 'score': 16, 'group' : 2, 'condition': 'cognition'}
d2 = {'Subj_ID': '002', 'score': 22, 'group' : 1, 'condition': 'perception'}

# Create a dataframe 
df = pd.DataFrame([d1, d2], [0, 1])

# Check out the dataframe
df

	Subj_ID	condition	group	score
0	001	cognition	2	16
1	002	perception	1	22

# You can index in pandas
df['condition']

# You can index in pandas
df.loc[0,:]

Working with DataFrames¶

df.describe()

# Take the average of all numeric columns
df.mean()

Clicker Question #2¶

Comparing them to standard library Python types, which is the best mapping for these new data types?

A) DataFrames are like lists, arrays are like tuples
B) DataFrames and arrays are like lists
C) DataFrames are like tuples, arrays are like lists
D) DataFrames and arrays are like dictionaries
E) Dataframes are like dictionaries, arrays are like lists

Plotting¶

%matplotlib inline

import matplotlib.pyplot as plt

# Create some data
dat = np.array([1, 2, 4, 8, 16, 32])

# Plot the data
plt.plot(dat);

../_images/19-ScientificComputing_48_0.png

can change plot type
lots of customizations possible

Analysis¶

scipy - statistical analysis
sklearn - machine learning

import scipy as sp
from scipy import stats

# Simulate some data
d1 = stats.norm.rvs(loc=0, size=1000)
d2 = stats.norm.rvs(loc=0.5, size=1000)

Analysis - Plotting the Data¶

# Plot the data
plt.hist(d1, 25, alpha=0.6);
plt.hist(d2, 25, alpha=0.6);

../_images/19-ScientificComputing_54_0.png

Analysis - Statistical Comparisons¶

# Statistically compare the two distributions
stats.ttest_ind(d1, d2)

Ttest_indResult(statistic=-9.33809588164776, pvalue=2.5256641524949454e-20)

COGS108: Data Science in Practice¶

If you are interested in data science and scientific computing in Python, consider taking COGS108 : https://github.com/COGS108/.

previous

Open Source

next

Documentation