B12. Introduction to Pandas
[1]:
# Colab setup ------------------
import os, sys
if "google.colab" in sys.modules:
data_path = "https://biocircuits.github.io/chapters/data/"
else:
data_path = "data/"
# ------------------------------
import os
import numpy as np
# Pandas, conventionally imported as pd
import pandas as pd
Pandas is the primary tool in the Python ecosystem for handling data. Its primary object, the DataFrame
is extremely useful for loading and wrangling data. In this book, we primarily use Pandas for loading in and accessing data sets, which is what we describe here. Pandas’s functionality extend far beyond these simple applications.
The data set
We will explore using Pandas the data set from the chapter on noise, acquired by Elowitz and coworkers (Elowitz, et al., 2000). The data are available in this CSV file.
You will notice in the first cell of this notebook that we set a different path to where data sets are located, stored in the data_path
variable. If you running this notebook on Colab, the path to the data set goes directly to the URL where it is hosted. If you are running locally on your own machine, be sure to download the CSV file and move it to a data
directory in your current directory.
To access the directory, we can use the convenient os.path.join()
function to give the full path.
[2]:
os.path.join(data_path, "elowitz_et_al_2002_fig_3a.csv")
[2]:
'data/elowitz_et_al_2002_fig_3a.csv'
We will use this convenient way of specifying paths when we load in data files.
The data set contains normalized fluorescence levels and measured noise for collections of bacterial cells in separate experiments featuring varying levels of inducer.
Let’s look at the first few lines of the file.
[3]:
with open(os.path.join(data_path, "elowitz_et_al_2002_fig_3a.csv"), "r") as f:
for i, line in enumerate(f):
if i < 10:
print(line.rstrip())
strain,CFP,YFP
m22,2438,1409
m22,2316,1391
m22,2521,1511
m22,2646,1460
m22,2830,1638
m22,2567,1572
m22,2636,1450
m22,2644,1335
m22,2897,1689
The first line contains the headers for each column. In this case, the first column contains the strain, the second the intensity of the CFP channel, and the third the intensity of the YFP channel.
To load in the data in this file, Pandas has a very powerful function, pd.read_csv()
that can read in a CSV file and store the contents in a convenient data structure called a data frame. In Pandas, the data type for a data frame is DataFrame
, and we will use “data frame” and “DataFrame
” interchangeably.
Reading in data
Take a look at the doc string of pd.read_csv(). Holy cow! There are so many options we can specify for reading in a CSV file. You will likely find reasons to use many of these throughout your research. We do not really need to specify keyword arguments for this data set, fortunately. So, let’s proceed to load in the data set.
[4]:
df = pd.read_csv(os.path.join(data_path, "elowitz_et_al_2002_fig_3a.csv"))
# Check the type
type(df)
[4]:
pandas.core.frame.DataFrame
We now have the data stored in a data frame. We can look at it in the Jupyter notebook, since Jupyter will display it in a well-organized, pretty way.
[5]:
df
[5]:
strain | CFP | YFP | |
---|---|---|---|
0 | m22 | 2438 | 1409 |
1 | m22 | 2316 | 1391 |
2 | m22 | 2521 | 1511 |
3 | m22 | 2646 | 1460 |
4 | m22 | 2830 | 1638 |
... | ... | ... | ... |
529 | d22 | 2249 | 1561 |
530 | d22 | 2241 | 1673 |
531 | d22 | 2345 | 1659 |
532 | d22 | 2776 | 1960 |
533 | d22 | 2114 | 1555 |
534 rows × 3 columns
This is a nice representation of the data, but we really do not need to display that many rows of the data frame in order to understand its structure. Instead, we can use the head()
method of data frames to look at the first few rows.
[6]:
df.head()
[6]:
strain | CFP | YFP | |
---|---|---|---|
0 | m22 | 2438 | 1409 |
1 | m22 | 2316 | 1391 |
2 | m22 | 2521 | 1511 |
3 | m22 | 2646 | 1460 |
4 | m22 | 2830 | 1638 |
This is more manageable and gives us an overview of what the columns are.
Indexing data frames
The data frame is a convenient data structure for many reasons that will become clear as we start exploring. Let’s start by looking at how data frames are indexed. Let’s try to look at the first row.
[7]:
df[0]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
3801 try:
-> 3802 return self._engine.get_loc(casted_key)
3803 except KeyError as err:
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[7], line 1
----> 1 df[0]
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
3805 if self.columns.nlevels > 1:
3806 return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
3808 if is_integer(indexer):
3809 indexer = [indexer]
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
3802 return self._engine.get_loc(casted_key)
3803 except KeyError as err:
-> 3804 raise KeyError(key) from err
3805 except TypeError:
3806 # If we have a listlike key, _check_indexing_error will raise
3807 # InvalidIndexError. Otherwise we fall through and re-raise
3808 # the TypeError.
3809 self._check_indexing_error(key)
KeyError: 0
Yikes! Lots of errors. The problem is that we tried to index numerically by row. We index DataFrames by columns. And there is no column that has the name 0
in this data frame, though there could be. Instead, a might want to look at the column with the CFP intensity.
[ ]:
df["CFP"]
This gave us the numbers we were after. Notice that when it was printed, the index of the rows came along with it. If we wanted to pull out a fluorescence value, say corresponding to index 4
, we can do that.
[ ]:
df["CFP"][4]
However, this is not the preferred way to do this. It is better to use .loc
. This give the location in the data frame we want.
[ ]:
df.loc[4, "CFP"]
Note that following .loc
, we have the index by row then column, separated by a comma, in brackets. It is also important to note that row indices need not be integers. And you should not count on them being integers. In practice you will almost never use row indices, but rather use Boolean indexing.
Boolean indexing of data frames
Let’s say I wanted the CFP fluorescent intensity only for the m22 strain. I can use Boolean indexing to specify the rows of interest. Specifically, I want the row for which df['strain'] == 'm22'
. You can essentially plop this syntax directly when using .loc
.
[ ]:
df.loc[df["strain"] == "m22", "CFP"]
We can extend this to ask for even more specific records, like the record for each bacterium from the m22 strain with CFP fluorescence above 2500 and YFP fluorescence below 1200. We can again use Boolean indexing, but we need to use an &
operator. We did not cover this bitwise operator before, but the syntax is self-explanatory in the example below. Note that it is important that each Boolean operation you are doing is in parentheses because of the precedence of the operators involved.
[ ]:
df.loc[
(df["strain"] == "m22") & (df["CFP"] > 2500) & (df["YFP"] < 1200),
:,
]
There were only two records. The syntax to accomplish this is a bit cumbersome. The code is clearer if we set up our Boolean indexing first, as follows.
[ ]:
inds = (df["strain"] == "m22") & (df["CFP"] > 2500) & (df["YFP"] < 1200)
df.loc[inds, :]
Note that if we want all columns, we can use :
for the column index. Similarly, if we want all rows, we use :
for the row index.
Notice also that inds
is an array (actually a Pandas Series
, essentially a DataFrame
with one column) of True
s and False
s. When we index with it using .loc
, we get back rows where inds
is True
.
Of interest in this exercise in Boolean indexing is that we never had to write a loop. To produce our indices, we could have done the following.
[ ]:
# Initialize array of Boolean indices
inds = [False] * len(df)
# Iterate over the rows of the DataFrame to check if the row should be included
for i, r in df.iterrows():
if r["strain"] == "m22" and r["CFP"] > 2500 and r["YFP"] < 1200:
inds[i] = True
# Make our seleciton with Boolean indexing
df.loc[inds, :]
This feature, where the looping is done automatically on Pandas objects like data frames, is very powerful and saves us writing lots of lines of code. This example also showed how to use the iterrows()
method of a data frame to iterate over the rows of a data frame. It is actually rare that you will need to do that, as we’ll show next when computing with data frames.
Calculating with data frames
We can do elementwise calculations on elements of a data frame just like we could with NumPy arrays.
[ ]:
# Do silly, arbitrary calculations
df["CFP divided by 1000"] = df["CFP"] / 1000
df["CFP + YPF"] = df["CFP"] + df["YFP"]
# Take a look
df.head()
We can also apply NumPy functions, e.g. to compute the mean CFP fluorescence (irrespective of strain).
[ ]:
np.mean(df["CFP"])
Extracting data as NumPy arrays
We sometimes want to take a column or other subset of a data frame and have it available as a NumPy array (usually for performance reasons). The values
attribute of a data frame allows for this. As an example, if we want to take out all of the CFP fluorescence values for the d22 strain as a NumPy array, we can do the following.
[ ]:
df.loc[df["strain"]=="d22", "CFP"].values
Computing environment
[ ]:
%load_ext watermark
%watermark -v -p numpy,pandas,jupyterlab