4 Prepare Yourself for data
This chapter is the Python version of the last chapter. We learn how to write and run Python code in Posit Cloud RStudio IDE.
4.1 Python in RStudio
Python Script
A Python script is a .py file that contains Python code. To create a Python script, go to File > New > Python Script, or click the green-plus icon on the topleft corner and select Python Script. Here I print the string Hello, World!"
, create an string object b
storing Hello, World!"
, and then print the 3rd to 4th letter of the string.
Run Python Code
- Running Python code may require you to update some packages. Please say YES!
- When you run the Python code in the script, the console will switch from R to Python.
- Type
quit
in the Python console to switch back to the R console.
To type and run Python code directly in the console, with RStudio we install the R package reticulate
(see R package section below for more discussion). Once the reticulate
is installed, we use the call library()
to load it into into memory for direct use of it in our computing environment. Then the function call repl_python()
will turn R console into Python console. The word “repl” means Read, Evaluate, Print, and Loop. We the type Python code after the prompt >>>
, and run it by hitting Enter/return.
To return back to R console, just type quit
or exit
in the Python console, and hit Enter/return.
After we run the Python script from Figure 4.1, the following object is stored in the environment:
- Object
b
storing a stringHello World!
4.2 Basic Python
In this section, we learn some basic Python syntax by translating the R code in the previous chapter. While their output may be a little different, the R and Python code will provide exactly the same result.
4.2.1 Python Packages 📦
Same as R, there are many Python packages/libraries out there. Most popular python packages by their purposes include:
Data visualization: matplotlib, seaborn
Machine Learning: scikit-learn
Statistics: statsmodels
Without those packages, Python cannot or at least very hard to do statistical and data analysis. Again, Python is a general purpose language, but R is built for statistics. R has many built-in functionalities specifically for statistics and data management. But with those packages, Python can also do fancy statistics and data science.
To install a Python package in Posit cloud, in your RStudio project, run
library(reticulate)
virtualenv_create("myenv")
Then go to Tools > Global Options > Python > Select > Virtual Environments
You may need to restart R session. Do it, and in the new R session, run the following to install NumPy, pandas, and matplotlib packages.
library(reticulate)
py_install(c("numpy", "pandas", "matplotlib"))
Run the following Python code, and make sure everything goes well.
import numpy as np
import pandas as pd
= np.array([3, 8])
v1
v1= pd.DataFrame({"col": ['red', 'blue', 'green']})
df df
4.2.2 Operators
Python and R have very similar arithmetic and logical operator syntax.
## Arithmetic Example
2 + 3 * 5 + 4
# 21
2 + 3 * (5 + 4)
# 29
## Logical Example
5 <= 5
# True
5 <= 4
# False
5 != 5 # Is 5 NOT equal to 5?
# False
To negate key words True
or False
in Python, we add not
in front of them. Also, instead of using &
and |
, we use and
and or
to comparisons.
## Boolean Operations
True != False # Is TRUE not equal to FALSE?
# True
not True == False # Is not TRUE equal to FALSE?
# True
True or False # TRUE if either one is TRUE or both are TRUE
# True
To use mathematical functions in Python, we need to import the math
module. Then to call any function in this module, we first type the module name followed by dot, and then the function name. Such syntax is general for using methods in a Python module or library, like libraryname.functionname()
.
## Built-in Functions
import math
144)
math.sqrt(# 12.0
1) # Euler's number
math.exp(# 2.718281828459045
/ 2)
math.sin(math.pi # 1.0
abs(-7)
# 7
5)
math.factorial(# 120
100) # Natural log with base e
math.log(# 4.605170185988092
100, 10) # Log function with specified base 10
math.log(# 2.0
In Python, we use =
to do assignment.
## Assignment
= 5
x
x# 5
## Variable Operations
= x + 6
x
x# 11
== 5
x # False
math.log(x)# 2.3978952727983707
As shown in Basic R, the following code is an example of BAD naming! Never do this!
## Bad Naming (Avoid Doing This in Practice)
= 20 # This is bad coding, avoid overwriting built-in names
math.pi abs
# <built-in function abs>
abs = abs(math.pi)
abs
4.2.3 Object types
To check object type, we also use type()
. Note that in Python 5
itself is an integer with no decimal places. 5.0
instead is of type float
corresponding to type double
in R. We can turn a float into an integer using int()
. The character
type in R is the type str
short for string in Python.
## Type Checking
type(5)
# <class 'int'>
type(5.0)
# <class 'float'>
type(int(5.0))
# <class 'int'>
type("I_love_stats!")
# <class 'str'>
The logical
type in R is the type bool
short for boolean in Python.
## Boolean Type
type(1 > 3)
# <class 'bool'>
print(isinstance(5, int))
# True
We can use isinstance()
function to check whether or not the specified object is of the specified type.
isinstance(5, float)
# False
isinstance(5.0, int)
# False
4.2.4 Python data structures
Python has built-in data structures including lists, tuples, dictionaries, and sets. However, they are not specifically for statistics and data science. We usually use “array” in the library NumPy
and “DataFrames” in pandas
for statistical analysis. There is no exactly one-to-one correspondence of R and Python data structures. Below I show you one Python version of R data structures. We can definitely use other Python structures to represent the same thing in R.
Vector (One-dimensioanl Array of NumPy)
Python has numbers and strings, but no built-in vector structure. To create a sequence type of structure, we can use a list that can save several elements in an single object. To create a list in Python, we use []
. For details about Python lists, please check Appendix B.
Here we use one dimensional array structure in NumPy to represent a vector. First we import and give NumPy a shorter name np
.
import numpy as np
Then to create a one dimensional array, we call the function array()
, with a list of objects inside.
## Vector Creation
= np.array([1, 2.5, 4.5])
dbl_vec
dbl_vec# array([1. , 2.5, 4.5])
= np.array([1, 6, 10])
int_vec
int_vec# array([ 1, 6, 10])
= np.array([True, False, False])
log_vec
log_vec# array([ True, False, False])
= np.array(["pretty", "girl"])
chr_vec
chr_vec# array(['pretty', 'girl'], dtype='<U6')
The function len()
is used to check the number of elements in the array, and the method .dtype
short for data type, can be used to check an NumPY object’s type or specify or convert data type. Note that when we write dbl_vec.dtype
, it in fact gives us the data type of the first element of the array. It is a 64-bit floating-point number, where 64-bit is its size saved in the memory.
## Vector Properties
len(dbl_vec) # Length of the vector
# 3
# Type of elements in the vector
dbl_vec.dtype # dtype('float64')
0].dtype
chr_vec[# dtype('<U6')
1].dtype
chr_vec[# dtype('<U4')
Operations on Vectors (1D Array)
Same as R, Python array operations happen element-wisely.
## Vector Arithmetic
= np.array([3, 8])
v1 = np.array([4, 100])
v2
# %%
# Vector addition
+ v2
v1 # array([ 7, 108])
# Vector subtraction
- v2
v1 # array([ -1, -92])
# Vector multiplication and division
* v2
v1 # array([ 12, 800])
/ v2
v1 # array([0.75, 0.08])
np.sqrt(v2)# array([ 2., 10.])
Recycling of Vectors (1D Array)
Unlike R, Python Numpy array does not support vector recycling unless scalar operations.
## Recycling in Vector Arithmetic
= np.array([3, 8, 4, 5])
v1 * 2 # Element-wise multiplication
v1 # array([ 6, 16, 8, 10])
* np.array([2, 2, 2, 2]) # Equivalent to above
v1 # array([ 6, 16, 8, 10])
= np.array([4, 11])
v3 # Use np.resize to automatically resize v3 to match v1's length
v1.shape# (4,)
= np.resize(v3, v1.shape)
v3_resized
v3_resized# array([ 4, 11, 4, 11])
+ v3_resized
v1 # array([ 7, 19, 8, 16])
If we do v1 + v3
, Python will render an error message below saying that the two vectors are not of the same size.
# Traceback (most recent call last):
# File "<string>", line 1, in <module>
# ValueError: operands could not be broadcast together with shapes (4,) (2,)
So in Python, we need to do recycling manually. We can first resize (.resize
) the vector v3
so that is has the same size (.shape
) as v1
.
Subsetting Vectors (1D Array)
Always keep in mind that the indexing of Python starts with 0!!! So we grab the first element with indexing [0]
, and in general the kth element with [k-1]
. If we want to keep multiple elements, we can use a Python list by a pair of square brackets []
. So v1[[0, 2]]
keeps the first and third element of v1
. In Python, we can use np.delete
method to remove elements.
In Python array, we cannot use v1[[-1, -2]]
or v1[-[1, 2]]
to remove the second and third element.
v1[[-1, -2]]
actually returns the last element of v1
followed by the second last. The negative indexing works in Python, and it means indexing from the last.
-1, -2]]
v1[[# array([5, 4])
v1[-[1, 2]]
instead will render an error. There is no such index rule by adding a negative sign in front of a list.
## Subsetting
v1# array([3, 8, 4, 5])
v2# array([ 4, 100])
0] # First element
v1[# 3
1] # Second element
v2[# 100
# %%
0, 2]] # Corresponds to v1[c(1, 3)] in R
v1[[# array([3, 4])
1, 2]) # Corresponds to v1[-c(2, 3)] in R
np.delete(v1, [# array([3, 5])
Factor (pd.Categorical())
There is no default data structure type factor
in Python. One similar to factor in Python is the Categorical
vector in pandas package. We first import the package into our working session, and call it pd
.
We can create a pandas categorical vector, we use pd.Categorical()
, and inside the call, we provide a list-like object.
# Factor equivalent in Python using pandas
= pd.Categorical(["med", "high", "low"])
fac
fac# ['med', 'high', 'low']
# Categories (3, object): ['high', 'low', 'med']
type(fac)
# <class 'pandas.core.arrays.categorical.Categorical'>
fac.categories
can check the categories or levels, and fac.codes
shows how those levels are coded in numbers. You see that by default they are ordered by the length of objects first and then alphabetically. low
and med
with 3 characters are shorter than high
, and “l” comes earlier than m
. Therefore we have low = 1, med = 2, and high = 3.
fac.categories# Index(['high', 'low', 'med'], dtype='object')
fac.codes# array([2, 0, 1], dtype=int8)
We can create an ordered categorical vector by adding ordered=True
. The order will follow the specification in the argument categories
. Notice that now we have ['low' < 'med' < 'high']
.
= pd.Categorical(["med", "high", "low"], categories=["low", "med", "high"], ordered=True)
order_fac
order_fac# ['med', 'high', 'low']
# Categories (3, object): ['low' < 'med' < 'high']
order_fac.codes# array([1, 2, 0], dtype=int8)
List
Python has it own built-in list structure. Unlike R list, Python lists cannot have named elements. To create a Python built-in list, we use []
. Check Appendix B for more details.
# Creating and accessing lists
= [[1, 2, 3], "a", [True, False]]
x_lst
x_lst# [[1, 2, 3], 'a', [True, False]]
type(x_lst)
# <class 'list'>
len(x_lst)
# 3
Extracting a single element of a Python list is straightforward. Just put the index in the square bracket.
# Subsetting list elements
0]
x_lst[# [1, 2, 3]
type(x_lst[0])
# <class 'list'>
If we would like to extract multiple elements in a Python list, we need to use a slice operator that is represented by colons. It takes at least two arguments: starting index and ending index. The starting index is called inclusive, and the ending index is called exclusive. For example lst[2:4]
means we grab the third element and the fourth element of a list lst
.
By default, it creates a sequence of indices with increment 1. We can add one more colon followed by the specified gap of the indices. For example 2:8:2
will create a sequence of indices (2, 4, 6). Not that 8 is excluded. The followings show some examples.
0:2:1]
x_lst[# [[1, 2, 3], 'a']
0:2]
x_lst[# [[1, 2, 3], 'a']
0:3:2]
x_lst[# [[1, 2, 3], [True, False]]
Matrix (2-dimensional numpy array)
The matrix structure is in fact a 2-dimensional array which can be created by the numpy package.
To create a 2D array, we create a list of list in np.array()
. The first list element is the the first row of the resulting matrix, and the second list element is the the second row of the resulting matrix, and so on. By default, Python will fill in elements row by row.
= np.array([[1, 4], [2, 5], [3, 6]])
mat
mat# array([[1, 4],
# [2, 5],
# [3, 6]])
We could also create a list of numbers, then use .reshape
method to decide the dimension of the matrix, and how the numbers in the list are filled in the matrix. The argument order = "F"
means we’d like to fill elements by columns.
= np.array([1, 2, 3, 4, 5, 6]).reshape((3, 2), order = "F")
mat
mat# array([[1, 4],
# [2, 5],
# [3, 6]])
# Dimension
mat.shape # (3, 2)
0] # Number of rows
mat.shape[# 3
1] # Number of columns
mat.shape[# 2
Subsetting a Matrix
Subsetting a matrix in Python is similar to that in R. We have two sets of indices for row and column respectively that are separated by comma. However, in Python, if we keep all rows or all columns, we need to add colon :
in the row index or column index. Also, remember that indexing in Python starts with 0.
1] # Second column
mat[:, # array([4, 5, 6])
1, :] # Second row and all columns
mat[# array([2, 5])
0, 2], :] # First and third rows
mat[[# array([[1, 4],
# [3, 6]])
Stacking Matrices
To combine two matrices, we use np.hstack()
and np.vstack()
. np.hstack()
is similar to cbind()
in R that stacks arrays in sequence horizontally (column wise). Instead, np.vstack()
is similar to r
bind()` in R that stacks arrays in sequence vertically (row wise)
# Column binding (cbind in R)
print(mat)
# [[1 4]
# [2 5]
# [3 6]]
= np.array([7, 0, 0, 8, 2, 6]).reshape((3, 2), order='F')
mat_c print(np.hstack((mat, mat_c))) # Should have the same number of rows
# [[1 4 7 8]
# [2 5 0 2]
# [3 6 0 6]]
# Row binding (rbind in R)
print(mat)
# [[1 4]
# [2 5]
# [3 6]]
= np.array([1, 2, 3, 4]).reshape((2, 2), order='F')
mat_r print(np.vstack((mat, mat_r))) # Should have the same number of columns
# [[1 4]
# [2 5]
# [3 6]
# [1 3]
# [2 4]]
Data Frame
Python has no built-in data frame structure, and the numpy package does not supply it too. The data frame can be created using the pandas package using the command pd.DataFrame()
once we import pandas as pd.
import pandas as pd
Inside pd.DataFrame()
, we need to provide a sequence of named objects, where names are column names or variable names of the data frame. To provide such, we can use a Python built-in dictionary that is generated by {}
with the key-value structure as key:value
. The keys in the dictionary will work as the column names of the resulting data frame. For more details about Python dictionaries, please check Appendix B.
# Creating a DataFrame
= pd.DataFrame({"age": [19, 21, 40], "gender": ["m", "f", "m"]})
df
df# age gender
# 0 19 m
# 1 21 f
# 2 40 m
We can check summary of the data frame using the .info()
method.
# DataFrame structure
df.info()# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 3 entries, 0 to 2
# Data columns (total 2 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 age 3 non-null int64
# 1 gender 3 non-null object
# dtypes: int64(1), object(1)
# memory usage: 180.0+ bytes
Properties of Data Frames
# Accessing DataFrame properties
# Names of columns
df.columns # Index(['age', 'gender'], dtype='object')
len(df) # Number of observations
# 3
1] # Number of columns
df.shape[# 2
# Dimensions
df.shape # (3, 2)
type(df) # Type of df
# <class 'pandas.core.frame.DataFrame'>
__name__ # Class of df
df.__class__.# 'DataFrame'
To combine two data frames, we can use pd.concat()
command. Notice the difference between the case with and without ignore_index=True
. If True, the resulting axis will be labeled \(0, . . ., n - 1\). This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
# Row binding with DataFrames
= pd.DataFrame({"age": [10], "gender": ["f"]})
df_r =True)
pd.concat([df, df_r], ignore_index# age gender
# 0 19 m
# 1 21 f
# 2 40 m
# 3 10 f
pd.concat([df, df_r])# age gender
# 0 19 m
# 1 21 f
# 2 40 m
# 0 10 f
By default, pd.concat()
combines two data frames by rows (axis=0). If we like to combine data frames by columns, we add axis=1
in the function.
# Column binding with DataFrames
= pd.DataFrame({"col": ["red", "blue", "gray"]})
df_c = pd.concat([df, df_c], axis=1)
df_new
df_new# age gender col
# 0 19 m red
# 1 21 f blue
# 2 40 m gray
Subsetting a Data Frame
To access a group of rows and columns of a data frame, we can use .loc()
or iloc()
. loc
is short for location, and i
stands for index. In the examples, we use iloc()
with index [0, 2]
to grab the first and the third row. We use .loc()
to get the ‘age’ column by its name, which can also be got using df_new["age"]
. We can put a selection condition in the brackets similar to what we do in R.
# Subsetting rows
0, 2], :] # Subset rows
df_new.iloc[[# age gender col
# 0 19 m red
# 2 40 m gray
'age']
df_new.loc[:, # 0 19
# 1 21
# 2 40
# Name: age, dtype: int64
"age"]
df_new[# 0 19
# 1 21
# 2 40
# Name: age, dtype: int64
"age"] == 21] # Select row where age == 21
df_new[df_new[# age gender col
# 1 21 f blue
Look carefully the difference between one and two brackets subsetting. With one bracket, the data frame becomes a one dimensional pandas Series which is similar to 1D numpy array, and can work as a vector. When two brackets are used for subsetting, the data frame structure is kept. The two outputs are printed differently too.
# Subsetting columns
"age"] # become a pd Series
df_new[# 0 19
# 1 21
# 2 40
# Name: age, dtype: int64
type(df_new["age"])
# <class 'pandas.core.series.Series'>
"age"]] # Still a pd DataFrame
df_new[[# age
# 0 19
# 1 21
# 2 40
type(df_new[["age"]])
# <class 'pandas.core.frame.DataFrame'>
"age", "gender"]] # Multiple columns like a matrix
df_new[[# age gender
# 0 19 m
# 1 21 f
# 2 40 m
"age", "gender"]] # Equivalent to matrix-like subsetting
df_new.loc[:, [# age gender
# 0 19 m
# 1 21 f
# 2 40 m
4.3 Exercises
- Vector
The code above shows a Marquette student poker and roulette winnings from Monday to Friday. Copy and paste them into your Python session and complete problem 1.
- Assign to the variable
total_daily
how much you won or lost on each day in total (poker and roulette combined). - Calculate the winnings overall
total_week
. Print it out.
# ==============================================================================
## Factor
# ==============================================================================
# Create speed_vector
= pd.Categorical(["medium", "low", "low", "medium", "high"]) speed_vec
- Factor
-
speed_vec
above should be converted to an ordinal factor since its categories have a natural ordering. Create an ordered factor vectorspeed_fac
by completing the code below.
# Create speed_vector
= pd.Categorical(______, categories=___________, ordered=______) ___________
# ==============================================================================
## Data frame
# ==============================================================================
# Defining vectors for planets
= ["Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune"]
name = ["Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
planet_type "Terrestrial planet", "Gas giant", "Gas giant",
"Gas giant", "Gas giant"]
= [0.375, 0.947, 1, 0.537, 11.219, 9.349, 4.018, 3.843]
diameter = [57.63, -242.03, 1, 1.05, 0.42, 0.44, -0.73, 0.65]
rotation = [False, False, False, False, True, True, True, True] rings
- Data Frame
Data frames have properties of lists and matrices, so we skip lists and matrices and focus on data frames. You want to construct a data frame that describes the main characteristics of eight planets in our solar system. You feel confident enough to create the necessary vectors: name
, planet_type
, diameter
, rotation
and rings
that have already been coded up as above. The first element in each of these vectors corresponds to the first observation.
- Use the function
pd.DataFrame()
to construct a data frame. Pass the vectorsname
,planet_type
,diameter
,rotation
andrings
as arguments in this order. Call the resulting data frameplanets_df
.
= pd.___________({____:_____, ____:_____, ____:_____, ____:_____, ____:_____}) _________
- From
planets_df
, select the diameter of Mercury: this is the value at the first row and the third column. Simply print out the result. - From
planets_df
, select all data on Mars (the fourth row). Simply print out the result. - Select and print out the first 5 values in the
diameter
column ofplanets_df
.