B12. Introduction to Pandas


[1]:
# Colab setup ------------------
import os, sys
if "google.colab" in sys.modules:
    data_path = "https://biocircuits.github.io/chapters/data/"
else:
    data_path = "data/"
# ------------------------------

import os
import numpy as np

# Pandas, conventionally imported as pd
import pandas as pd

Pandas is the primary tool in the Python ecosystem for handling data. Its primary object, the DataFrame is extremely useful for loading and wrangling data. In this book, we primarily use Pandas for loading in and accessing data sets, which is what we describe here. Pandas’s functionality extend far beyond these simple applications.

The data set

We will explore using Pandas the data set from the chapter on noise, acquired by Elowitz and coworkers (Elowitz, et al., 2000). The data are available in this CSV file.

You will notice in the first cell of this notebook that we set a different path to where data sets are located, stored in the data_path variable. If you running this notebook on Colab, the path to the data set goes directly to the URL where it is hosted. If you are running locally on your own machine, be sure to download the CSV file and move it to a data directory in your current directory.

To access the directory, we can use the convenient os.path.join() function to give the full path.

[2]:
os.path.join(data_path, "elowitz_et_al_2002_fig_3a.csv")
[2]:
'data/elowitz_et_al_2002_fig_3a.csv'

We will use this convenient way of specifying paths when we load in data files.

The data set contains normalized fluorescence levels and measured noise for collections of bacterial cells in separate experiments featuring varying levels of inducer.

Let’s look at the first few lines of the file.

[3]:
with open(os.path.join(data_path, "elowitz_et_al_2002_fig_3a.csv"), "r") as f:
    for i, line in enumerate(f):
        if i < 10:
            print(line.rstrip())
strain,CFP,YFP
m22,2438,1409
m22,2316,1391
m22,2521,1511
m22,2646,1460
m22,2830,1638
m22,2567,1572
m22,2636,1450
m22,2644,1335
m22,2897,1689

The first line contains the headers for each column. In this case, the first column contains the strain, the second the intensity of the CFP channel, and the third the intensity of the YFP channel.

To load in the data in this file, Pandas has a very powerful function, pd.read_csv() that can read in a CSV file and store the contents in a convenient data structure called a data frame. In Pandas, the data type for a data frame is DataFrame, and we will use “data frame” and “DataFrame” interchangeably.

Reading in data

Take a look at the doc string of pd.read_csv(). Holy cow! There are so many options we can specify for reading in a CSV file. You will likely find reasons to use many of these throughout your research. We do not really need to specify keyword arguments for this data set, fortunately. So, let’s proceed to load in the data set.

[4]:
df = pd.read_csv(os.path.join(data_path, "elowitz_et_al_2002_fig_3a.csv"))

# Check the type
type(df)
[4]:
pandas.core.frame.DataFrame

We now have the data stored in a data frame. We can look at it in the Jupyter notebook, since Jupyter will display it in a well-organized, pretty way.

[5]:
df
[5]:
strain CFP YFP
0 m22 2438 1409
1 m22 2316 1391
2 m22 2521 1511
3 m22 2646 1460
4 m22 2830 1638
... ... ... ...
529 d22 2249 1561
530 d22 2241 1673
531 d22 2345 1659
532 d22 2776 1960
533 d22 2114 1555

534 rows × 3 columns

This is a nice representation of the data, but we really do not need to display that many rows of the data frame in order to understand its structure. Instead, we can use the head() method of data frames to look at the first few rows.

[6]:
df.head()
[6]:
strain CFP YFP
0 m22 2438 1409
1 m22 2316 1391
2 m22 2521 1511
3 m22 2646 1460
4 m22 2830 1638

This is more manageable and gives us an overview of what the columns are.

Indexing data frames

The data frame is a convenient data structure for many reasons that will become clear as we start exploring. Let’s start by looking at how data frames are indexed. Let’s try to look at the first row.

[7]:
df[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3801 try:
-> 3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[7], line 1
----> 1 df[0]

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
   3805 if self.columns.nlevels > 1:
   3806     return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
   3808 if is_integer(indexer):
   3809     indexer = [indexer]

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
   3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
-> 3804     raise KeyError(key) from err
   3805 except TypeError:
   3806     # If we have a listlike key, _check_indexing_error will raise
   3807     #  InvalidIndexError. Otherwise we fall through and re-raise
   3808     #  the TypeError.
   3809     self._check_indexing_error(key)

KeyError: 0

Yikes! Lots of errors. The problem is that we tried to index numerically by row. We index DataFrames by columns. And there is no column that has the name 0 in this data frame, though there could be. Instead, a might want to look at the column with the CFP intensity.

[ ]:
df["CFP"]

This gave us the numbers we were after. Notice that when it was printed, the index of the rows came along with it. If we wanted to pull out a fluorescence value, say corresponding to index 4, we can do that.

[ ]:
df["CFP"][4]

However, this is not the preferred way to do this. It is better to use .loc. This give the location in the data frame we want.

[ ]:
df.loc[4, "CFP"]

Note that following .loc, we have the index by row then column, separated by a comma, in brackets. It is also important to note that row indices need not be integers. And you should not count on them being integers. In practice you will almost never use row indices, but rather use Boolean indexing.

Boolean indexing of data frames

Let’s say I wanted the CFP fluorescent intensity only for the m22 strain. I can use Boolean indexing to specify the rows of interest. Specifically, I want the row for which df['strain'] == 'm22'. You can essentially plop this syntax directly when using .loc.

[ ]:
df.loc[df["strain"] == "m22", "CFP"]

We can extend this to ask for even more specific records, like the record for each bacterium from the m22 strain with CFP fluorescence above 2500 and YFP fluorescence below 1200. We can again use Boolean indexing, but we need to use an & operator. We did not cover this bitwise operator before, but the syntax is self-explanatory in the example below. Note that it is important that each Boolean operation you are doing is in parentheses because of the precedence of the operators involved.

[ ]:
df.loc[
    (df["strain"] == "m22") & (df["CFP"] > 2500) & (df["YFP"] < 1200),
    :,
]

There were only two records. The syntax to accomplish this is a bit cumbersome. The code is clearer if we set up our Boolean indexing first, as follows.

[ ]:
inds = (df["strain"] == "m22") & (df["CFP"] > 2500) & (df["YFP"] < 1200)

df.loc[inds, :]

Note that if we want all columns, we can use : for the column index. Similarly, if we want all rows, we use : for the row index.

Notice also that inds is an array (actually a Pandas Series, essentially a DataFrame with one column) of Trues and Falses. When we index with it using .loc, we get back rows where inds is True.

Of interest in this exercise in Boolean indexing is that we never had to write a loop. To produce our indices, we could have done the following.

[ ]:
# Initialize array of Boolean indices
inds = [False] * len(df)

# Iterate over the rows of the DataFrame to check if the row should be included
for i, r in df.iterrows():
    if r["strain"] == "m22" and r["CFP"] > 2500 and r["YFP"] < 1200:
        inds[i] = True

# Make our seleciton with Boolean indexing
df.loc[inds, :]

This feature, where the looping is done automatically on Pandas objects like data frames, is very powerful and saves us writing lots of lines of code. This example also showed how to use the iterrows() method of a data frame to iterate over the rows of a data frame. It is actually rare that you will need to do that, as we’ll show next when computing with data frames.

Calculating with data frames

We can do elementwise calculations on elements of a data frame just like we could with NumPy arrays.

[ ]:
# Do silly, arbitrary calculations
df["CFP divided by 1000"] = df["CFP"] / 1000
df["CFP + YPF"] = df["CFP"] + df["YFP"]

# Take a look
df.head()

We can also apply NumPy functions, e.g. to compute the mean CFP fluorescence (irrespective of strain).

[ ]:
np.mean(df["CFP"])

Extracting data as NumPy arrays

We sometimes want to take a column or other subset of a data frame and have it available as a NumPy array (usually for performance reasons). The values attribute of a data frame allows for this. As an example, if we want to take out all of the CFP fluorescence values for the d22 strain as a NumPy array, we can do the following.

[ ]:
df.loc[df["strain"]=="d22", "CFP"].values

Computing environment

[ ]:
%load_ext watermark
%watermark -v -p numpy,pandas,jupyterlab