Introduction to NumPy
Contents
Introduction to NumPy¶
Disclaimer: Most of the content in this notebook is coming from www.scipy-lectures.org
NumPy¶
NumPy is the fundamental package for scientific computing with Python. It is the basic building block of most data analysis in Python and contains highly optimized routines for creating and manipulating arrays.
Everything revolves around numpy arrays¶
Scipy
adds a bunch of useful science and engineering routines that operate on numpy arrays. E.g. signal processing, statistical distributions, image analysis, etc.pandas
adds powerful methods for manipulating numpy arrays. Like data frames in R - but typically faster.scikit-learn
supports state-of-the-art machine learning over numpy arrays. Inputs and outputs of virtually all functions are numpy arrays.If you want many more short exercises than the ones in this notebook - you can find 100 of them here
NumPy arrays vs Python arrays¶
NumPy arrays look very similar to Pyhon arrays.
import numpy as np
a = np.array([0, 1, 2, 3])
a
array([0, 1, 2, 3])
So, why is this useful? NumPy arrays are memory-efficient containers that provide fast numerical operations. We can show this very quickly by running a simple computation on the same two arrays.
L = range(1000)
# Computing the power of the first 1000 numbers with Python arrays
%timeit [i**2 for i in L]
187 µs ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
a = np.array(L)
# Computing the power of the first 1000 numbers with Numpy arrays
%timeit a**2
1.28 µs ± 38.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Creating arrays¶
Manual construction of arrays¶
You can create NumPy arrays manually almost in the same way as in Python in general.
2-D, 3-D, …¶
b = np.array([[0, 1, 2], [3, 4, 5]]) # 2 x 3 array
b
array([[0, 1, 2],
[3, 4, 5]])
b.shape
(2, 3)
c = np.array([[[1], [2]], [[3], [4]]])
c
array([[[1],
[2]],
[[3],
[4]]])
c.shape
(2, 2, 1)
Functions for creating arrays¶
In practice, we rarely enter items one by one. Therefore, NumPy offers many different helper functions.
Evenly spaced¶
a = np.arange(10) # 0 .. n-1 (!)
a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
b = np.arange(1, 9, 2) # start, end (exclusive), step
b
array([1, 3, 5, 7])
… or by number of points¶
c = np.linspace(0, 1, 6) # start, end, num-points
c
array([0. , 0.2, 0.4, 0.6, 0.8, 1. ])
d = np.linspace(0, 1, 5, endpoint=False)
d
array([0. , 0.2, 0.4, 0.6, 0.8])
Common arrays¶
a = np.ones((3, 3)) # reminder: (3, 3) is a tuple
a
array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
b = np.zeros((2, 2))
b
array([[0., 0.],
[0., 0.]])
c = np.eye(3)
c
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
d = np.diag(np.array([1, 2, 3, 4]))
d
array([[1, 0, 0, 0],
[0, 2, 0, 0],
[0, 0, 3, 0],
[0, 0, 0, 4]])
np.random
: random numbers (Mersenne Twister PRNG)¶
a = np.random.rand(4) # uniform in [0, 1]
a
array([0.72812914, 0.2888934 , 0.16006505, 0.57169499])
b = np.random.randn(4) # Gaussian
b
array([-1.06432023, 0.56855512, -0.8431078 , 0.01674151])
np.random.seed(1234) # Setting the random seed
Basic data types¶
You may have noticed that, in some instances, array elements are displayed with a trailing dot (e.g. 2.
vs 2
). This is due to a difference in the data-type used:
a = np.array([1, 2, 3])
a.dtype
dtype('int64')
b = np.array([1., 2., 3.])
b.dtype
dtype('float64')
Different data-types allow us to store data more compactly in memory, but most of the time we simply work with floating point numbers. Note that, in the example above, NumPy auto-detects the data-type from the input.
You can explicitly specify which data-type you want:
c = np.array([1, 2, 3], dtype=float)
c.dtype
dtype('float64')
The default data type is floating point:
a = np.ones((3, 3))
a.dtype
dtype('float64')
There are also other types:
# Complex
d = np.array([1+2j, 3+4j, 5+6*1j])
d.dtype
dtype('complex128')
# Bool
e = np.array([True, False, False, True])
e.dtype
dtype('bool')
# Strings
f = np.array(['Bonjour', 'Hello', 'Hallo',])
f.dtype # <--- strings containing max. 7 letters
dtype('<U7')
And much more…
int32
int64
uint32
uint64
Indexing and slicing¶
The items of an array can be accessed and assigned to the same way as other Python sequences (e.g. lists):
a = np.arange(10)
a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
a[0], a[2], a[-1]
(0, 2, 9)
Warning: Indices begin at 0, like other Python sequences (and C/C++). In contrast, in Fortran or Matlab, indices begin at 1.
The usual python idiom for reversing a sequence is supported:
a[::-1]
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
For multidimensional arrays, indexes are tuples of integers:
a = np.diag(np.arange(3))
a
array([[0, 0, 0],
[0, 1, 0],
[0, 0, 2]])
a[1, 1]
1
a[2, 1] = 10 # third line, second column
a
array([[ 0, 0, 0],
[ 0, 1, 0],
[ 0, 10, 2]])
a[1]
array([0, 1, 0])
Note¶
In 2D, the first dimension corresponds to rows, the second to columns.
For multidimensional
a
,a[0]
is interpreted by taking all elements in the unspecified dimensions.
Slicing: Arrays, like other Python sequences can also be sliced¶
a = np.arange(10)
a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
a[2:9:3] # [start:end:step]
array([2, 5, 8])
Note that the last index is not included!
a[:4]
array([0, 1, 2, 3])
All three slice components are not required: by default, start
is 0,
end
is the last and step
is 1:
a[1:3]
array([1, 2])
a[::2]
array([0, 2, 4, 6, 8])
a[3:]
array([3, 4, 5, 6, 7, 8, 9])
A small illustrated summary of NumPy indexing and slicing…
You can also combine assignment and slicing:
a = np.arange(10)
a[5:] = 10
a
array([ 0, 1, 2, 3, 4, 10, 10, 10, 10, 10])
b = np.arange(5)
a[5:] = b[::-1]
a
array([0, 1, 2, 3, 4, 4, 3, 2, 1, 0])
Fancy indexing¶
NumPy arrays can be indexed with slices, but also with boolean or integer arrays (masks). This method is called fancy indexing. It creates copies not views.
Using boolean masks¶
np.random.seed(3)
a = np.random.randint(0, 21, 15)
a
array([10, 3, 8, 0, 19, 10, 11, 9, 10, 6, 0, 20, 12, 7, 14])
(a % 3 == 0)
array([False, True, False, True, False, False, False, True, False,
True, True, False, True, False, False])
mask = (a % 3 == 0)
extract_from_a = a[mask] # or, a[a%3==0]
extract_from_a # extract a sub-array with the mask
array([ 3, 0, 9, 6, 0, 12])
Indexing with a mask can be very useful to assign a new value to a sub-array:
a[a % 3 == 0] = -1
a
array([10, -1, 8, -1, 19, 10, 11, -1, 10, -1, -1, 20, -1, 7, 14])
Indexing with an array of integers¶
a = np.arange(0, 100, 10)
a
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])
Indexing can be done with an array of integers, where the same index is repeated several time:
a[[2, 3, 2, 4, 2]] # note: [2, 3, 2, 4, 2] is a Python list
array([20, 30, 20, 40, 20])
New values can be assigned with this kind of indexing:
a[[9, 7]] = -100
a
array([ 0, 10, 20, 30, 40, 50, 60, -100, 80, -100])
The image below illustrates various fancy indexing applications:
Elementwise operations¶
NumPy provides many elementwise operations that are much quicker than comparable list comprehension in plain Python.
Basic operations¶
With scalars:
a = np.array([1, 2, 3, 4])
a + 1
array([2, 3, 4, 5])
2**a
array([ 2, 4, 8, 16])
All arithmetic operates elementwise:
b = np.ones(4) + 1
a - b
array([-1., 0., 1., 2.])
a * b
array([2., 4., 6., 8.])
j = np.arange(5)
2**(j + 1) - j
array([ 2, 3, 6, 13, 28])
Array multiplication is not matrix multiplication¶
c = np.ones((3, 3))
c * c # NOT matrix multiplication!
array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
# Matrix multiplication
c.dot(c)
array([[3., 3., 3.],
[3., 3., 3.],
[3., 3., 3.]])
Other operations¶
Comparisons¶
a = np.array([1, 2, 3, 4])
b = np.array([4, 2, 2, 4])
a == b
array([False, True, False, True])
a > b
array([False, False, True, False])
Array-wise comparisons:
a = np.array([1, 2, 3, 4])
b = np.array([4, 2, 2, 4])
c = np.array([1, 2, 3, 4])
np.array_equal(a, b)
False
np.array_equal(a, c)
True
Logical operations¶
a = np.array([1, 1, 0, 0], dtype=bool)
b = np.array([1, 0, 1, 0], dtype=bool)
np.logical_or(a, b)
array([ True, True, True, False])
np.logical_and(a, b)
array([ True, False, False, False])
Transcendental functions¶
a = np.arange(5)
np.sin(a)
array([ 0. , 0.84147098, 0.90929743, 0.14112001, -0.7568025 ])
np.log(a)
/opt/miniconda-latest/envs/neuro/lib/python3.7/site-packages/ipykernel_launcher.py:1: RuntimeWarning: divide by zero encountered in log
"""Entry point for launching an IPython kernel.
array([ -inf, 0. , 0.69314718, 1.09861229, 1.38629436])
np.exp(a)
array([ 1. , 2.71828183, 7.3890561 , 20.08553692, 54.59815003])
Shape mismatches¶
a = np.arange(4)
# NBVAL_SKIP
a + np.array([1, 2])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-67-d870f1bba8f8> in <module>
1 # NBVAL_SKIP
----> 2 a + np.array([1, 2])
ValueError: operands could not be broadcast together with shapes (4,) (2,)
Broadcasting? We’ll return to that later.
Transposition¶
a = np.triu(np.ones((3, 3)), 1)
a
array([[0., 1., 1.],
[0., 0., 1.],
[0., 0., 0.]])
a.T
array([[0., 0., 0.],
[1., 0., 0.],
[1., 1., 0.]])
The transposition is a view¶
As a result, the following code is wrong and will not make a matrix symmetric:
>>> a += a.T
It will work for small arrays (because of buffering) but fail for large one, in unpredictable ways.
Basic reductions¶
NumPy offers many quick functions to compute things like sum, mean, max etc.
Computing sums¶
x = np.array([1, 2, 3, 4])
np.sum(x)
10
Note: Certain NumPy functions can be also written at the end of an Numpy array.
x.sum()
10
Sum by rows and by columns:
x = np.array([[1, 1], [2, 2]])
x
array([[1, 1],
[2, 2]])
x.sum(axis=0) # columns (first dimension)
array([3, 3])
x.sum(axis=1) # rows (second dimension)
array([2, 4])
Other reductions¶
Like, mean
, std
, cumsum
etc. works the same way (and take axis=
).
Extrema¶
x = np.array([1, 3, 2])
x.min()
1
x.max()
3
x.argmin() # index of minimum
0
x.argmax() # index of maximum
1
Logical operations¶
np.all([True, True, False])
False
np.any([True, True, False])
True
Can be used for array comparisons:
a = np.zeros((100, 100))
np.any(a != 0)
False
np.all(a == a)
True
a = np.array([1, 2, 3, 2])
b = np.array([2, 2, 3, 2])
c = np.array([6, 4, 4, 5])
((a <= b) & (b <= c)).all()
True
Statistics¶
x = np.array([1, 2, 3, 1])
y = np.array([[1, 2, 3], [5, 6, 1]])
x.mean()
1.75
np.median(x)
1.5
np.median(y, axis=-1) # last axis
array([2., 5.])
x.std() # full population standard dev.
0.82915619758885
… and many more (best to learn as you go).
Broadcasting¶
Basic operations on
numpy
arrays (addition, etc.) are elementwiseThis works on arrays of the same size. Nevertheless, It’s also possible to do operations on arrays of different sizes if NumPy can transform these arrays so that they all have the same size: this conversion is called broadcasting.
The image below gives an example of broadcasting:
Let’s verify this:
a = np.tile(np.arange(0, 40, 10), (3, 1)).T
a
array([[ 0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]])
b = np.array([0, 1, 2])
a + b
array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
We have already used broadcasting without knowing it!
a = np.ones((4, 5))
a[0] = 2 # we assign an array of dimension 0 to an array of dimension 1
a
array([[2., 2., 2., 2., 2.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.]])
An useful trick:
a = np.arange(0, 40, 10)
a.shape
(4,)
a = a[:, np.newaxis] # adds a new axis -> 2D array
a.shape
a
array([[ 0],
[10],
[20],
[30]])
a + b
array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
Broadcasting seems a bit magical, but it is actually quite natural to use it when we want to solve a problem whose output data is an array with more dimensions than input data.
A lot of grid-based or network-based problems can also use broadcasting. For instance, if we want to compute the distance from the origin of points on a 10x10 grid, we can do:
x, y = np.arange(5), np.arange(5)[:, None]
distance = np.sqrt(x ** 2 + y ** 2)
distance
array([[0. , 1. , 2. , 3. , 4. ],
[1. , 1.41421356, 2.23606798, 3.16227766, 4.12310563],
[2. , 2.23606798, 2.82842712, 3.60555128, 4.47213595],
[3. , 3.16227766, 3.60555128, 4.24264069, 5. ],
[4. , 4.12310563, 4.47213595, 5. , 5.65685425]])
Or in color:
Array shape manipulation¶
Sometimes your arrays don’t have the right shape. Also for this, NumPy has many solutions.
Flattening¶
a = np.array([[1, 2, 3], [4, 5, 6]])
a
array([[1, 2, 3],
[4, 5, 6]])
a.ravel()
array([1, 2, 3, 4, 5, 6])
a.T
array([[1, 4],
[2, 5],
[3, 6]])
a.T.ravel()
array([1, 4, 2, 5, 3, 6])
Higher dimensions: last dimensions ravel out “first”.
Reshaping¶
The inverse operation to flattening:
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
a
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
a.reshape(6, 2)
array([[ 1, 2],
[ 3, 4],
[ 5, 6],
[ 7, 8],
[ 9, 10],
[11, 12]])
Or,
a.reshape((6, -1)) # unspecified (-1) value is inferred
array([[ 1, 2],
[ 3, 4],
[ 5, 6],
[ 7, 8],
[ 9, 10],
[11, 12]])
Adding a dimension¶
Indexing with the np.newaxis
or None
object allows us to add an axis to an array (you have seen this already above in the broadcasting section):
z = np.array([1, 2, 3])
z
array([1, 2, 3])
z[:, np.newaxis]
array([[1],
[2],
[3]])
z[np.newaxis, :]
array([[1, 2, 3]])
Dimension shuffling¶
a = np.arange(4*3*2).reshape(4, 3, 2)
a
array([[[ 0, 1],
[ 2, 3],
[ 4, 5]],
[[ 6, 7],
[ 8, 9],
[10, 11]],
[[12, 13],
[14, 15],
[16, 17]],
[[18, 19],
[20, 21],
[22, 23]]])
a.shape
(4, 3, 2)
b = a.transpose(1, 2, 0)
b
array([[[ 0, 6, 12, 18],
[ 1, 7, 13, 19]],
[[ 2, 8, 14, 20],
[ 3, 9, 15, 21]],
[[ 4, 10, 16, 22],
[ 5, 11, 17, 23]]])
b.shape
(3, 2, 4)
Resizing¶
Size of an array can be changed with ndarray.resize
:
a = np.arange(4)
a.resize((8,))
a
array([0, 1, 2, 3, 0, 0, 0, 0])
Sorting data¶
Sorting along an axis:
a = np.array([[4, 3, 5], [1, 2, 1]])
b = np.sort(a, axis=1)
b
array([[3, 4, 5],
[1, 1, 2]])
Important: Note that the code above sorts each row separately!
In-place sort:
a.sort(axis=1)
a
array([[3, 4, 5],
[1, 1, 2]])
Sorting with fancy indexing:
a = np.array([4, 3, 1, 2])
j = np.argsort(a)
j
array([2, 3, 1, 0])
a[j]
array([1, 2, 3, 4])
Finding minima and maxima:
a = np.array([4, 3, 1, 2])
j_max = np.argmax(a)
j_min = np.argmin(a)
j_max, j_min
(0, 2)
npy
- NumPy’s own data format¶
NumPy has its own binary format, not portable but with efficient I/O:
data = np.ones((3, 3))
np.save('pop.npy', data)
data3 = np.load('pop.npy')
Summary - What do you need to know to get started?¶
Know how to create arrays :
array
,arange
,ones
,zeros
.Know the shape of the array with
array.shape
, then use slicing to obtain different views of the array:array[::2]
, etc. Adjust the shape of the array usingreshape
or flatten it withravel
.Obtain a subset of the elements of an array and/or modify their values with masks
a[a < 0] = 0
Know miscellaneous operations on arrays, such as finding the mean or max (
array.max()
,array.mean()
).For advanced use: master the indexing with arrays of integers, as well as broadcasting.