Date Manipulation in Python for Data Analysis
In this article, we will look at advanced date manipulation with Python. From dates, we will extract weeks, months, quarters, years, and their multiple combinations necessary for our data analysis.
After this, we would make visualizations of our data summarized on these combinations. We would also see how to add data labels in the bar graph.
First, we import the necessary libraries.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Load the data.
df = pd.read_csv('lightning_strikes_2016.csv', on_bad_lines='skip')
df2 = pd.read_csv('lightning_strikes_2018.csv', on_bad_lines='skip')
The parameter on_bad_lines handles the rows that have more entries than the number of columns.
Next, I have done some data cleaning by filtering some rows out. Later in the article, when I was using div() function on the count of strikes, I got an error. It was due to dirty string entries there. So, I identified the inconsistent dates in the data and filtered those out.
arr = ['225.5)', '20)', '2016INT(-82.1 22.3)',
'2016-07-146-07-14', '2012)', '2016-0.3 50.2)', '2016-07-15 18.5)',
'20T(-75.5 41.9)', '203.5)', '2016-07-49.5)', '2016', '2014',
'2016--07-14', '2016-08 45.2)', '2016-07-1T(-87.3 31.9)',
'2016-07-1416-07-14', '2016-07-141', '2016-09 22.9)',
'2016-07-0 50.7)', '2034.6)', '2016-0.9)', '2016-07-5',
'2016-07-7-15', '2016-07-1NT(-92.9 33.4)', '2015',
'2016-07-16-07-15', '2016-07-15 19.6)', '2016-',
'2016-07NT(-79 35.7)', '2016-07-T(-84.8 19.9)', '2016-07- 25.2)',
'2016--98.8 40.4)', '215', '2016016-07-15', '2016(-81.1 21.8)',
'20-07-15', '216-07-15', '2016-07-016-07-15']
df = df[~df['date'].isin(arr)]
df = df[df['number_of_strikes'].apply(lambda x: isinstance(x, int))]
After filtering those rows, I am converting the data type of all the entries of the column 'number_of_strikes' to an integer type. For this, I am using lambda function to avoid 'for' loop. A lambda function takes the inputs from an iterable and applies the defined operation on all the entries of that iterable.
Now, combine both data sets.
df = pd.concat([df,df2])
The date entries are in the string format. For date manipulation, we need to convert those into the datatime format.
df['date'] = pd.to_datetime(df['date'])
We are creating four new columns each for week, month, quarter, and year. We are using 'dt.strftime()' function of the datetime object to extract these values. The parameter after % sign returns the numerical values of the entity. For example, %Y returns the numerical value of year. Similarly %V, %m, and %q return the numerical values of the week, month, and quarter. Any string part in between these two percent parameters is printed as it is. In '%Y-W%V' the sample output would be 2016-W12.
df['week'] = df['date'].dt.strftime('%Y-W%V')
df['month'] = df['date'].dt.strftime('%Y-%m')
df['quarter'] = df['date'].dt.to_period('Q').dt.strftime('%Y-Q%q')
df['year'] = df['date'].dt.strftime('%Y')
Here is the output of the above manipulation.
df.head(10)
date | number_of_strikes | center_point_geom | week | month | quarter | year | |
0 | 2016-01-04 | 55 | POINT(-83.2 21.1) | 2016-W01 | 2016-01 | 2016-Q1 | 2016 |
1 | 2016-01-04 | 33 | POINT(-83.1 21.1) | 2016-W01 | 2016-01 | 2016-Q1 | 2016 |
2 | 2016-01-05 | 46 | POINT(-77.5 22.1) | 2016-W01 | 2016-01 | 2016-Q1 | 2016 |
3 | 2016-01-05 | 28 | POINT(-76.8 22.3) | 2016-W01 | 2016-01 | 2016-Q1 | 2016 |
4 | 2016-01-05 | 28 | POINT(-77 22.1) | 2016-W01 | 2016-01 | 2016-Q1 | 2016 |
5 | 2016-01-05 | 30 | POINT(-76.7 22.3) | 2016-W01 | 2016-01 | 2016-Q1 | 2016 |
6 | 2016-01-05 | 34 | POINT(-76.8 22.4) | 2016-W01 | 2016-01 | 2016-Q1 | 2016 |
7 | 2016-01-06 | 31 | POINT(-74.2 25.9) | 2016-W01 | 2016-01 | 2016-Q1 | 2016 |
8 | 2016-01-06 | 24 | POINT(-76 22.9) | 2016-W01 | 2016-01 | 2016-Q1 | 2016 |
9 | 2016-01-06 | 25 | POINT(-75.3 22.7) | 2016-W01 | 2016-01 | 2016-Q1 | 2016 |
Now. let's plot a bar chart for the number of strikes per week in the year 2018. For this, we would filter the data frame for the year 2018 and then apply the grouping function.
df_by_week_2018 = df[df['year'] == '2018'][['week','number_of_strikes']].groupby('week').sum().reset_index()
df_by_week_2018.head()
week | number_of_strikes | |
0 | 2018-W01 | 34843 |
1 | 2018-W02 | 353425 |
2 | 2018-W03 | 37132 |
3 | 2018-W04 | 412772 |
4 | 2018-W05 | 34972 |
Now plot a bar graph using matplotlib.pyplot
plt.bar(x = df_by_week_2018['week'], height = df_by_week_2018['number_of_strikes'])
plt.plot()
plt.xlabel('Week Number')
plt.xlabel('Number of lightning strikes')
plt.title('Number of lightning strikes per week')
The labels on x-axis are quite mingled up. We cannot see these labels. To fix these, we would use the xticks() function of pyplot to rotate and size the fonts in labels on x-axis.
plt.figure(figsize=(20,5))
plt.bar(x = df_by_week_2018['week'], height = df_by_week_2018['number_of_strikes'])
plt.plot()
plt.xlabel('Week Number')
plt.xlabel('Number of lightning strikes')
plt.title('Number of lightning strikes per week')
plt.xticks(rotation = 45, fontsize = 8)
plt.show()
Next, plot lightning strikes by quarter for the full date range of available data. For visualization, it will be easiest to work with numbers in millions, such as 25.2 million. As an example, the following code will divide the number_of_strikes
column by one million.
df_by_quarter = df['number_of_strikes'].div(1000000)
df_by_quarter.head()
That's the step for which I cleaned the data at the start of the article.
Now, let's make another data frame for the quarterly summary. Here we will also place the data labels on the bars in the visualization. So, we need to create a column for the data labels in the millions format too. I am achieving this by:
Dividing strikes column by a million
Changing the type to float
Rounding the result to one precision point
Converting the type to string
Concatenating with "M" to represent million
# Group 2016-2018 data by quarter and sum
df_by_quarter = df[['quarter','number_of_strikes']].groupby('quarter').sum().reset_index()
# Format as text, in millions
df_by_quarter['number_of_strikes_formatted'] = df_by_quarter['number_of_strikes'].div(1000000).astype(float).round(1).astype(str) + 'M'
df_by_quarter.head()
quarter | number_of_strikes | number_of_strikes_formatted | |
0 | 2016-Q1 | 2683798 | 2.7M |
1 | 2016-Q2 | 15078446 | 15.1M |
2 | 2016-Q3 | 21738874 | 21.7M |
3 | 2016-Q4 | 1969754 | 2.0M |
4 | 2017-Q1 | 2444279 | 2.4M |
# data label function
def addlabels (x, y, labels):
for i in range(len(x)):
plt.text(i, y[i], labels[i], ha = 'center', va = 'bottom')
This function iterates over data and plots text labels above each bar of the bar graph.
plt.figure(figsize=(15,5))
plt.bar(x = df_by_quarter['quarter'], height= df_by_quarter['number_of_strikes'])
addlabels(df_by_quarter['quarter'], df_by_quarter['number_of_strikes'], df_by_quarter['number_of_strikes_formatted'])
plt.plot()
plt.xlabel('Quarter')
plt.ylabel('Number of lightning strikes')
plt.title('Number of lightning strikes per quarter (2016-2018)')
plt.show()
We can create a grouped bar chart to better compare year-over-year changes each quarter. We'll do this by creating two new columns that break out the quarter and year from the quarter
column. To do this, we use the quarter
column and take the last two characters to get quarter_number
, and take the first four characters to get year
.df_by_quarter['quarter_number'] =
df_by_quarter['quarter_number'] = df_by_quarter['quarter'].str[-2:]
df_by_quarter['year'] = df_by_quarter['quarter'].str[:4]
df_by_quarter.head()
quarter | number_of_strikes | number_of_strikes_formatted | quarter_number | year | |
0 | 2016-Q1 | 2683798 | 2.7M | Q1 | 2016 |
1 | 2016-Q2 | 15078446 | 15.1M | Q2 | 2016 |
2 | 2016-Q3 | 21738874 | 21.7M | Q3 | 2016 |
3 | 2016-Q4 | 1969754 | 2.0M | Q4 | 2016 |
4 | 2017-Q1 | 2444279 | 2.4M | Q1 | 2017 |
plt.figure(figsize = (15, 5))
p = sns.barplot(
data = df_by_quarter,
x = 'quarter_number',
y = 'number_of_strikes',
hue = 'year')
for b in p.patches:
p.annotate(str(round(b.get_height()/1000000, 1))+'M',
(b.get_x() + b.get_width() / 2., b.get_height() + 1.2e6),
ha = 'center', va = 'bottom',
xytext = (0, -12),
textcoords = 'offset points')
plt.xlabel("Quarter")
plt.ylabel("Number of lightning strikes")
plt.title("Number of lightning strikes per quarter (2016-2018)")
plt.show()
In this advanced bar graph, we are giving color legend to the year by assigning it to hue parameter. The 'patches' function of pyplot is used to access the bars in the graph. The 'annotate' function of pyplot is used to place labels on these bars. The'xytext' parameter in 'annotate' is used to slightly off set the label from its valued position just for the sake of display. The 'textcoords' parameter set the type of off set for 'xytext'.
Here ends this article with some practice needed to grip these techniques. Thanks for reading. If you like the content, do share it.
I am pasting the GitHub link to access the code and data files.
https://github.com/zahadali/date-manipulation-in-python
The link to the YouTube channel is pasted for learning videos.
https://www.youtube.com/@Analytics_10x
Happy learning and sharing.