pandas: An Ultimate Python Library for Data Science
In this article, I will introduce the pandas library of Python programming language for data science. We will also see practical examples of code to create data frames, logical operations, and looping, apart from examples of code for the advanced concepts of pandas.
Introduction to pandas
pandas is a great library of Python for data science for most industry applications with massive amounts of different types of data. In this tutorial, we will discuss the use of pandas, including the advanced concepts of the pandas library for data science.
We generally have a massive amount of data. And to handle it, we have already explored NumPy for data science. But is NumPy capable of handling the massive amount of data? The answer is yes but with limitations.
Numpy can handle data if the data is of the same type, i.e., numerical. However, having data of the same kind is challenging in the real world. So, at that time, pandas are needed. Another limitation is that NumPy performs better when the number of rows is 50,000 or less.
NOTE: pandas is built on top of NumPy.
pandas store data in the ‘n’ form of data frames. We can even import CSV files using pandas. Let’s explore more about data frames and see how to store data in the DataFrame.
Creation of Data Frame Out of the Dictionary
As shown in the code below, I have a dictionary, and we have created a data frame using pandas.
import pandas as ps
countryData = {
'country': ['india', 'australia'],
'population': [500, 600]
}
ps.DataFrame(data=countryData)
# output
country population
0 india 500
1 australia 600
To change the index of the above data frame:
data = ps.DataFrame(data=countryData)
data.index = ['IN', 'AUS'] print(data)
# output
country population
IN india 500
AUS australia 600
To print the selected column of the data frame:
data = ps.DataFrame(data=countryData)
data.index = ['IN', 'AUS']
data['country'] # It will output the pandas series
data[['country']] # It will output pandas dataFrame
data[['country', 'population']] # It is used to select more than 1 column
data[0:2] # It will output first 2 observations (rows)
To select the row in DataFrame, we use loc or iloc. The difference between the two is that if we use iloc, we need to supply an integer index, but with loc, we can use the string index only. Look at the example shown below:
# Syntax data.loc[rows, columns]
data = ps.DataFrame(data=countryData)
data.index = ['IN', 'AUS']
data.iloc[1] # will return 2nd row
data.iloc['IN'] # error
data.loc[1] # error
data.loc['IN'] # return 1st row
data.loc[['IN','AUS']] # similar to column selection
NOTE: pandas recommend using “loc” or “iloc” to select data from the DataFrame instead of using [] to select the data. For example, data[‘IN’]
To import a CSV file as a data frame, we use “read_csv()” function provided by pandas:
import pandas as pds
population = pds.read_csv('./population.csv', index_col=0)
NOTE: index_col is set to 0, to tell read_csv function that 1st row in CSV file is header.
We will be dealing with large datasets most of the time, and reading all the data in one go is not a good practice. Doing so will have a drastic performance impact. Instead, with pandas, we can read data in chunks using the “chunksize” property as demonstrated below:
import pandas as pd
sub_dataset = pd.read_csv('test.csv', chunksize=10)
# on demand data
print(next(sub_dataset))
Logical Operations on pandas
Like we can perform a logical operation on a NumPy array, in a very similar way, we can perform logical operations on a panda’s data frame as well:
import pandas as ps
countryData = {
'country': ['India', 'China', 'France'],
'population': [560, 880, 990],
'averageRate': [15, 37, 18]
}
dataSet = ps.DataFrame(data=countryData)
Suppose I want to filter out only results with the “averageRate” above 16.
The first step is to select the column “averageRate,” and the next step is to put the condition:
dataSet['averageRate'] > 16
The next step is to find out the data frame for it:
dataSet[dataSet['averageRate'] > 16]
# output:
averageRate country population
1 37 China 880
2 18 France 990
We can also use logical and/or conditions in pandas using NumPy. Suppose I want results that have “averageRate” above 16 and less than 30:
import pandas as ps
import numpy as np
countryData = {
'country': ['India', 'China', 'France'],
'population': [560, 880, 990],
'averageRate': [15, 37, 18]
}
dataSet = ps.DataFrame(data=countryData)
dataSet[np.logical_and(dataSet['averageRate'] > 16, dataSet['averageRate'] < 30)]
# output
averageRate country population
2 18 France 990
Looping over pandas DataFrame
Iteration over a pandas DataFrame is typically done with the “iterrows()” method. Consider the example shown below:
import pandas as ps
countryData = {
'country': ['India', 'China', 'France'],
'population': [560, 880, 990],
'averageRate': [15, 37, 18]
}
dataSet = ps.DataFrame(data=countryData)
for index, row in dataSet.iterrows():
print(index)
print(row)
If we want to get data of the column:
for index, row in dataSet.iterrows() :
print(str(index) +': '+ str(row['country']))
# output:
0: India
1: China
2: France
What if we want to add a new column to the above DataSet we created? Yes, we can do that as well in pandas:
for index, row in dataSet.iterrows() :
dataSet.loc[index, "new_averageRate"] = row["averageRate"] * 1.5
# output:
averageRate country population new_averageRate
0 15 India 560 22.5
1 37 China 880 55.5
2 18 France 990 27.0
In the above example, you can see that we are using “loc” to add a new column to the row. And we are creating a new calculation of 1.5 times the average rate.
Advanced Concept of pandas for Data Science
Pandas is one of the most effective libraries used to perform operations on the dataset. Earlier, I covered the basics of the pandas package. Now, I will cover some advanced concepts that pandas offer.
I have taken the TMDB 5000 movies dataset from Kaggle to demonstrate examples.
Please note that the links to the HTML file for the code and the actual code file of Jupyter Notebook for practice are at the bottom of this article.
Before we understand the concepts mentioned above, let’s see how the NumPy library can be used with Python. Below is an example to showcase how pandas can interoperate with NumPy.
import pandas as pd
tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
print(tmdbDataSet.values)
print(type(tmdbDataSet.values)) # numpy.ndarray
In this article, we will learn more about how pandas can interoperate with NumPy. Let’s start with understanding how to create a dataset from the list.
Creating a Dataset from Lists
We can generate a pandas’ dataset from the lists as well. Let’s look at an example of creating a dataset from the lists.
import pandas as pd
index = ['Name', 'Salary', 'Age'] details = [['Ashok', 'Mike', 'Arun'], [1200, 1400, 2500], [23, 28, 30]]
zippedList = list(zip(index, details))
dictObject = dict(zippedList)
df = pd.DataFrame(dictObject)
print(df)
We can also update the labels of the data columns. To update the labels, refer to the code shown below:
# Let's rename the column of the dataset which we have generated from the list
df = pd.DataFrame(dictObject)
list_labels = ['Age(In Years)', 'Name', 'Salary (In $)']
df.columns = list_labels print(df)
We can also change the headers of the column at the time of creating datasets using the names argument.
import pandas as pd
tmdbDataSet_sub = pd.read_csv('tmdb_5000_movies.csv', header=0, names=['budget', 'genres', 'home_page', 'id', 'keywords', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'movie_status', 'tagline', 'title', 'vote_average', 'vote_count'], comment='#')
print(tmdbDataSet_sub.head())
Note: We can also pass the header argument along with the names argument to rename the column labels.
Comment argument will ignore the line in the file having ‘#’ symbol in this case.
After reading the file, we can export the file to CSV or excel. Also, we can specify if we need indexes to be exported to the file.
tmdbDataSet.to_csv('tmdb_movies.csv', index=False)
tmdbDataSet.to_excel('tmdb_movies.xlsx', index=False)
Using the plot function of pandas
Pandas provide data visualization by interoperating with the Matplotlib library. We can plot all the columns of the dataset using the plot function. Let’s look at the example shown below:
import pandas as pd
import matplotlib.pyplot as plt
tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv') tmdbDataSet.plot()
plt.show()
We can plot all columns to the same graph using the above code. However, the output looks a bit clumsy, and it isn’t easy to understand if the columns we plotted have different measure values. So, let’s look at an alternative approach to plot columns in the graph:
import pandas as pd
import matplotlib.pyplot as plt
tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv') tmdbDataSet.plot(subplots=True)
plt.show()
You would have noticed we passed the “subplots” argument to the plot function this time. This will plot all the columns in the different graphs, making it easy to understand.
We can also plot selected columns from the dataset.
import pandas as pd
import matplotlib.pyplot as plt
tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv') columnList = ['vote_count','budget'] tmdbDataSet[columnList].plot(subplots=True)
plt.show()
From the output of the above graph, one thing to observe is how the pandas’ plot function puts the index value of data frames to the x-axis. Now, we can see how to manually set any value to the x-axis.
import pandas as pd
import matplotlib.pyplot as plt
tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv') tmdbDataSet.plot(x='production_companies', y=['budget'])
plt.show()
We can even generate a box plot or scatter graph using the pandas’ plot function.
import pandas as pd
import matplotlib.pyplot as plt
tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
cols = ['vote_count','budget']
tmdbDataSet[cols].plot(kind='box', subplots=True)
plt.show()
tmdbDataSet.plot(kind='scatter', x='budget', y='vote_count', s=tmdbDataSet.popularity)
plt.show()
The scatter plot accepts arguments for the size of each circle in the plot.
We can also plot a histogram plot using pandas. The histogram plot depicts the PDF (Probability distribution function). Its output shows us the probability of the value we are plotting on the x-axis.
import pandas as pd
import matplotlib.pyplot as plt
tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv') tmdbDataSet.plot(kind='hist', y='vote_average')
plt.show()
We can pass an additional cumulative parameter to tell the histogram whether to plot PDF or CDF. (Cumulative Distribution Function)
The output of the CDF plot depicts the probability of having a value plotted at the x-axis.
import pandas as pd
import matplotlib.pyplot as plt
tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv') tmdbDataSet.plot(kind='hist', y='vote_average', normed=True)
plt.show()
# Observations
# The output depicts probability of having different vote_averages
We can plot charts in different rows and columns using the following code:
figure, axes = plt.subplots(nrows=2, ncols=1)
tmdbDataSet.plot(ax=axes[0], kind='hist', y='vote_average') tmdbDataSet.plot(ax=axes[1], kind='hist', y='vote_average', normed=True, cumulative=True)
plt.show()
# Observations
# Here 1st curve plotted is PDF as we have seen earlier. # And 2nd second plot plotted is CDF.
Here, we have declared axes, and then we can use the ax argument to plot various graphs as one.
pandas with time series
If we have a date column in our dataset and want to create it as an index, pandas provide index_col and parse_dates parameters.
import pandas as pd
tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv', index_col='release_date', parse_dates=True) print(tmdbDataSet.head())
print(tmdbDataSet.loc['2010-Aug-01'])
print(tmdbDataSet.loc['2010-01-01 21:00:00':'2010-05-11 22:00:00'])
The advantage of parsing the date column and making it an index is filtering results based on the date. As an example, we are filtering results by 1st August 2010.
Also, we can leverage the loc function to check rows between the date range.
We can also format the date column in the list. pandas provide to_datetime function to do so.
import pandas as pd
print(pd.to_datetime(['2010-01-01 21:00:00','2010-05-11 22:00:00'], format='%Y-%m-%d %H:%M'))
Plotting Charts with Time Series Column
If we have an index as a date column on the calling plot function, pandas will automatically adjust the plotted plots.
import pandas as pd
tmdbDataSet_date_index = pd.read_csv('tmdb_5000_movies.csv', index_col='release_date',
parse_dates=True)
tmdbDataSet_date_index.vote_count.plot()
plt.show()
Resampling of Data in Pandas
We can also resample our dataset in pandas. Resampling means using statical methods to calculate results. There are two types of sampling:
Downsampling: Reduce rows in our dataset based on conditions. E.g., Consider you have a data set that records humidity every day. Then, you can downsample the dataset, which records humidity every month. Here, you are reducing the number of rows in the dataset.
import pandas as pd
tmdbDataSet_dateIndex = pd.read_csv('./Cleaning Data/tmdb_5000_movies.csv',
parse_dates=True, index_col='release_date') print(tmdbDataSet_date_index.resample('A').sum()) # Down sampling data yearly print(tmdbDataSet_date_index.resample('A').sum().count())
# Down sampling data yearly with chaining
Upsampling: Increase rows in our dataset based on condition. Again, consider a data set that records humidity every day. You can up the sample dataset to every hour. Here, you are increasing the number of rows in the dataset.
NOTE: It is always advisable to use sampling with some statistical methods like mean, count, sum, etc.
import pandas as pd
tmdbDataSet_dateIndex = pd.read_csv('./Cleaning Data/tmdb_5000_movies.csv',
parse_dates=True, index_col='release_date')
print(tmdbDataSet_date_index.resample('H').sum()) # up sampling data hourly print(tmdbDataSet_date_index.resample('4H').sum()) # up sampling data every 4 hours
Chaining and filtering
While resampling/filtering data, we can chain methods and get the desired results with pandas. Consider the example shown below:
import pandas as pd
tmdbDataSet_dateIndex = pd.read_csv('./Cleaning Data/tmdb_5000_movies.csv',
parse_dates=True, index_col='release_date')
# Down sampling data yearly with chaining print(tmdbDataSet_date_index.resample('A').sum().count())
Grouping of data
pandas provide a “group by” function, using which we can also group our dataset. We can do several operations with a group by function, like electing columns, grouping by multiple columns, performing multiple aggregations, and a lot more. Consider the example shown below:
import pandas as pd
tmdbDataSet = pd.read_csv('tmdb_5000_movies.csv')
# Count of dataset on each day print(tmdbDataSet.groupby('release_date').count())
# Sum of budget made on each day by 5000 movies print(tmdbDataSet.groupby('release_date')['budget'].sum())
# Grouping by the mutiple columns and selecting certain columns print(tmdbDataSet.groupby(['release_date', 'runtime'])[['popularity', 'budget']].sum())
# Multiple aggregations
print(tmdbDataSet.groupby(['release_date', 'runtime'])[['popularity', 'budget']] .agg(['sum', 'count']))
Conclusion
In this post, we have seen, with the help of examples, and saw how powerful the library is. This library has tremendous power to read huge datasets and perform data processing before we can pass the data to machine learning models.
Moreover, you may also like to know about data reindexing, transforming, and aggregating in pandas.
You can look at the code discussed in this article in this HTML file.
Also, for your practice, I am sharing the link to jupyter notebook link to jupyter notebook, where you can refer to and follow along with the post. This will help you get hands-on with the concepts explained.
Tavish lives in Hyderabad, India, and works as a result-oriented data scientist specializing in improving the major key performance business indicators.
He understands how data can be used for business excellence and is a focused learner who enjoys sharing knowledge.