1. Importing Data

One of the most magical things about R is its ability to handle matrices of data using the data frame object (or tibble in the Tidyverse). The pandas library gives Python much the same functionality.

Recall that a data frame is an object consisting of rows and columns (just like a spreadsheet or database table). In addition to storing the data, the data frame object also has useful properties and methods that can be accessed through code. Thus, the first step in data analysis is load data from a file into a data frame object.

1.1. Loading the Pandas library

A Python library contains things (objects, code) that is not part of the core Python language but is nonetheless useful to some community of users. Libraries save us from re-inventing the wheel: Once someone has created a library and made it available for download, we can use the contents of the library in our own Python programs.

For example, to use the pandas.DataFrame object in our programs, we must first import the Pandas library into our environment.

Start by creating a new notebook in Jupyter and moving to the first cell:

import pandas

It is now possible to directly reference pandas.DataFrame in my scripts and Jupyter will know what I am talking about. Note, however, that programmers are lazy, and typing the word “pandas” over and over again is seen as an unnecessary burden. As such, programmers typically alias the library when they import it:

import pandas as pd

Here, “pd” is the alias for “pandas”. So I can write pd.DataFrame in my code rather than pandas.DataFrame. Not sure what I will do with all the time I save.

Finally, if I plan to use the DataFrame object frequently in my code, I can call it out by name using the from keyword:

from pandas import DataFrame

Now I can type DataFrame rather than pd.DataFrame.

1.2. Creating a DataFrame object

Python is an object-orient programming language. You create an instance of an object by calling its constructor, typically with some arguments.

To illustrate, let’s create a data frame called “bank” that contains data from a CSV file stored on my computer in the “Data” folder. To do this, we call Panda’s read_csv method:

bank = pd.read_csv('Data/Bank.csv')
bank.head()
Employee EducLev JobGrade YrHired YrBorn Gender YrsPrior PCJob Salary
0 1 3 1 92 69 Male 1 No 32.0
1 2 1 1 81 57 Female 1 No 39.1
2 3 1 1 83 60 Female 0 No 33.2
3 4 2 1 87 55 Female 7 No 30.6
4 5 3 1 92 67 Male 0 No 29.0

Note that the head(n) method can be called on the new DataFrame bank. Head simply prints out the first n rows of the data frame so you can see if it imported okay.

Hint: Pointing to a file on your computer is always tricky when you don't have a pop-up file system chooser like you do in most Windows applications. Some basic hints:
  1. If practical, put the data file in the same folder as your Jupyter notebook. In the example above, I created a subfolder called "Data" and put the bank data there. This makes it easier for you to remember (and type) the path to the data file.
  2. If your data must reside somewhere else on your computer or network: Be very careful when typing. Cut and paste the path from Windows Explorer if required.
  3. Beware the slashes/backslashes. Unix-like operating systems seperate directory names with a slash "/" whereas Microsoft Windows uses a backslash "\". In general, Python seperates folders with forward slashes. The backward slash is used in Python as the 'escape' character. If you cut and paste from Windows Explorer, you must either change all the backslashes to slashes or add the "r" prefix to your pathname to indicate a "raw" string of text (Python will ingore the special meaning of backslashes). So the following should work for a more complex file location:
    1. pd.read_csv('C:/Users/Michael Brydon/Data/Bank.csv") — all forward slashes
    2. pd.read_csv(r'C:\Users\Michael Brydon\Data/Bank.csv") — mixture of slashes and backslashes with the "r" prefix

1.3. Creating a DataFrame object from an Excel file

The Pandas library supports importing from many different data sources, including CSV, Excel, JSON, or even direct database access over a network.

Here we use read_excel instead of read_csv. It is the same data, but in a different file format on my computer:

bankxl = pd.read_excel('Data/Bank.xlsx')
bankxl.head()
Employee EducLev JobGrade YrHired YrBorn Gender YrsPrior PCJob Salary
0 1 3 1 92 69 Male 1 No 32.0
1 2 1 1 81 57 Female 1 No 39.1
2 3 1 1 83 60 Female 0 No 33.2
3 4 2 1 87 55 Female 7 No 30.6
4 5 3 1 92 67 Male 0 No 29.0

Running head() after importing is a good idea just to make sure you imported what you think you imported. If not, you will have to either:

  • edit your source file (Excel) to make sure it has the necessary tabular structure (with the first row as headings, and so on)

  • add arguments to your read_... method to make sure you are starting at the correct row, reading the correct spreadsheet, and so on.

Here is where Google-ing and learning Python go hand in hand. If worst comes to worst, read the documentation

1.4. Confirming data frame properties

As noted previously, objects in Python expose useful properties and methods. For example, we can confirm the size of the bank data frame with the shape property, which gives us the number of rows (209) and columns (9):

bank.shape
(208, 9)

We can use the describe() method to generate some summary statistics:

bank.describe()
Employee EducLev JobGrade YrHired YrBorn YrsPrior Salary
count 208.000000 208.000000 208.000000 208.000000 208.000000 208.000000 208.000000
mean 104.500000 3.158654 2.759615 85.326923 54.605769 2.375000 39.921923
std 60.188592 1.467464 1.566529 6.987832 10.318988 3.135237 11.256154
min 1.000000 1.000000 1.000000 56.000000 30.000000 0.000000 26.700000
25% 52.750000 2.000000 1.000000 82.000000 47.750000 0.000000 33.000000
50% 104.500000 3.000000 3.000000 87.000000 56.500000 1.000000 37.000000
75% 156.250000 5.000000 4.000000 90.000000 63.000000 4.000000 44.000000
max 208.000000 5.000000 6.000000 93.000000 73.000000 18.000000 97.000000

First, notice the difference between a property like shape with no parentheses and a method like describe() with parentheses for passing arguments.

Hint: Press the Shift-Tab key while within the parantheses of a Python method for a list of the possible arguments. Hit Shift-Tab twice to get more detailed help.

Second, by default, the describe() method only summarizes the numerical columns. Recall that our data frame consists of nine columns. To see the issue, run the info() method:

bank.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 208 entries, 0 to 207
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Employee  208 non-null    int64  
 1   EducLev   208 non-null    int64  
 2   JobGrade  208 non-null    int64  
 3   YrHired   208 non-null    int64  
 4   YrBorn    208 non-null    int64  
 5   Gender    208 non-null    object 
 6   YrsPrior  208 non-null    int64  
 7   PCJob     208 non-null    object 
 8   Salary    208 non-null    float64
dtypes: float64(1), int64(6), object(2)
memory usage: 14.8+ KB

Here we see the nine columns in the data frame listed by data type. Six are integer values, one (Salary) is a floating point number, and two (Gender, PCJob) are “Objects” (string objects):

1.5. Converting strings to categories

Recall in R that we made a distinction between strings (character data) like a person’s name and factors, which look like strings but have a limited set of repeated values. Color is a factor, for example, because it takes on a few known values like “red”, “green”, and “blue”. Similarly, a column called car_brand would be a factor because it would contain a finite number of string values, such as “Ford”, “Tesla”, “Ferarri”.

Pandas permits the same distinction, but instead uses the term category instead of factor. This is telling: R uses the old statistical technical term “factor” whereas Pandas/Python uses the more straightforward term “category”. This is the difference between the two languages in a nutshell.

The process for replacing the two (string) “Object” columns with categories is similar to the one we used in R. The key is understanding how to reference columns in Python. Two possibilities:

  • square bracket notation: bank['Gender']

  • dot notation: bank.Gender

Of these two, square bracket notation is slightly more flexible because it permits column names with spaces, e.g., dataframe['column name']. The dot notation of this would fail because Python has no way of knowing what the space after “column” means: dataframe.column name.

Once we know how to reference a column (or a “Series” in Pandas-speak), we can run the type conversion method and specify “category” as the output data type:

bank['Gender'].astype('category')
0        Male
1      Female
2      Female
3      Female
4        Male
        ...  
203      Male
204      Male
205      Male
206      Male
207    Female
Name: Gender, Length: 208, dtype: category
Categories (2, object): [Female, Male]

Note at the bottom of the summary that the Gender category consists of two values: Female and Male.

Of course, running the command above just lists the converted values; it does not do anything to the “bank” data frame. To replace the existing column in the data frame, we use the assignment operator:

bank['Gender'] = bank['Gender'].astype('category')
bank['PCJob'] = bank['PCJob'].astype('category')

We can confirm the result by running the describe() method again, but this time using an argument to ensure we get only the categorical columns:

bank.describe(include='category')
Gender PCJob
count 208 208
unique 2 2
top Female No
freq 140 189

This output tells us that we have 208 rows of data. Both the “Gender” and “PCJob” columns have two unique categorical values and the most frequent are “Female” and “No” respectively. 140 of the 208 employees are female while 189/208 do not have PC jobs.

1.6. Type conversion using programming constructs

As an aside, it is possible to do the same type conversions en masse in Python using core programming constructs such as lists, iterators, and looping. Specifically, for a larger data set with many columns, it might make sense to do the following:

  1. make a list of the columns to convert

  2. iterate through each column making the conversion assignment, as above

  3. loop through the values of the variable col until all the columns are converted

Use of an iterated list is not really worth it in this case (two lines of code using either approach):

for col in ['Gender', 'PCJob']:
    bankxl[col] = bankxl[col].astype('category')
    
bank.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 208 entries, 0 to 207
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   Employee  208 non-null    int64   
 1   EducLev   208 non-null    int64   
 2   JobGrade  208 non-null    int64   
 3   YrHired   208 non-null    int64   
 4   YrBorn    208 non-null    int64   
 5   Gender    208 non-null    category
 6   YrsPrior  208 non-null    int64   
 7   PCJob     208 non-null    category
 8   Salary    208 non-null    float64 
dtypes: category(2), float64(1), int64(6)
memory usage: 12.1 KB