Date Manipulation in Python for Data Analysis

In this article, we will look at advanced date manipulation with Python. From dates, we will extract weeks, months, quarters, years, and their multiple combinations necessary for our data analysis.

After this, we would make visualizations of our data summarized on these combinations. We would also see how to add data labels in the bar graph.

First, we import the necessary libraries.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Load the data.

df = pd.read_csv('lightning_strikes_2016.csv', on_bad_lines='skip')
df2 = pd.read_csv('lightning_strikes_2018.csv', on_bad_lines='skip')

The parameter on_bad_lines handles the rows that have more entries than the number of columns.

Next, I have done some data cleaning by filtering some rows out. Later in the article, when I was using div() function on the count of strikes, I got an error. It was due to dirty string entries there. So, I identified the inconsistent dates in the data and filtered those out.

arr = ['225.5)', '20)', '2016INT(-82.1 22.3)',
       '2016-07-146-07-14', '2012)', '2016-0.3 50.2)', '2016-07-15 18.5)',
       '20T(-75.5 41.9)', '203.5)', '2016-07-49.5)', '2016', '2014',
       '2016--07-14', '2016-08 45.2)', '2016-07-1T(-87.3 31.9)',
       '2016-07-1416-07-14', '2016-07-141', '2016-09 22.9)',
       '2016-07-0 50.7)', '2034.6)', '2016-0.9)', '2016-07-5',
       '2016-07-7-15', '2016-07-1NT(-92.9 33.4)', '2015',
       '2016-07-16-07-15', '2016-07-15 19.6)', '2016-',
       '2016-07NT(-79 35.7)', '2016-07-T(-84.8 19.9)', '2016-07- 25.2)',
       '2016--98.8 40.4)', '215', '2016016-07-15', '2016(-81.1 21.8)',
       '20-07-15', '216-07-15', '2016-07-016-07-15']
df = df[~df['date'].isin(arr)]
df = df[df['number_of_strikes'].apply(lambda x: isinstance(x, int))]

After filtering those rows, I am converting the data type of all the entries of the column 'number_of_strikes' to an integer type. For this, I am using lambda function to avoid 'for' loop. A lambda function takes the inputs from an iterable and applies the defined operation on all the entries of that iterable.

Now, combine both data sets.

df = pd.concat([df,df2])

The date entries are in the string format. For date manipulation, we need to convert those into the datatime format.

df['date'] = pd.to_datetime(df['date'])

We are creating four new columns each for week, month, quarter, and year. We are using 'dt.strftime()' function of the datetime object to extract these values. The parameter after % sign returns the numerical values of the entity. For example, %Y returns the numerical value of year. Similarly %V, %m, and %q return the numerical values of the week, month, and quarter. Any string part in between these two percent parameters is printed as it is. In '%Y-W%V' the sample output would be 2016-W12.

df['week'] = df['date'].dt.strftime('%Y-W%V')
df['month'] = df['date'].dt.strftime('%Y-%m')
df['quarter'] = df['date'].dt.to_period('Q').dt.strftime('%Y-Q%q')
df['year'] = df['date'].dt.strftime('%Y')

Here is the output of the above manipulation.

df.head(10)

	date	number_of_strikes	center_point_geom	week	month	quarter	year
0	2016-01-04	55	POINT(-83.2 21.1)	2016-W01	2016-01	2016-Q1	2016
1	2016-01-04	33	POINT(-83.1 21.1)	2016-W01	2016-01	2016-Q1	2016
2	2016-01-05	46	POINT(-77.5 22.1)	2016-W01	2016-01	2016-Q1	2016
3	2016-01-05	28	POINT(-76.8 22.3)	2016-W01	2016-01	2016-Q1	2016
4	2016-01-05	28	POINT(-77 22.1)	2016-W01	2016-01	2016-Q1	2016
5	2016-01-05	30	POINT(-76.7 22.3)	2016-W01	2016-01	2016-Q1	2016
6	2016-01-05	34	POINT(-76.8 22.4)	2016-W01	2016-01	2016-Q1	2016
7	2016-01-06	31	POINT(-74.2 25.9)	2016-W01	2016-01	2016-Q1	2016
8	2016-01-06	24	POINT(-76 22.9)	2016-W01	2016-01	2016-Q1	2016
9	2016-01-06	25	POINT(-75.3 22.7)	2016-W01	2016-01	2016-Q1	2016

Now. let's plot a bar chart for the number of strikes per week in the year 2018. For this, we would filter the data frame for the year 2018 and then apply the grouping function.

df_by_week_2018 = df[df['year'] == '2018'][['week','number_of_strikes']].groupby('week').sum().reset_index()
df_by_week_2018.head()

	week	number_of_strikes
0	2018-W01	34843
1	2018-W02	353425
2	2018-W03	37132
3	2018-W04	412772
4	2018-W05	34972

Now plot a bar graph using matplotlib.pyplot

plt.bar(x = df_by_week_2018['week'], height = df_by_week_2018['number_of_strikes'])
plt.plot()
plt.xlabel('Week Number')
plt.xlabel('Number of lightning strikes')
plt.title('Number of lightning strikes per week')

The labels on x-axis are quite mingled up. We cannot see these labels. To fix these, we would use the xticks() function of pyplot to rotate and size the fonts in labels on x-axis.

plt.figure(figsize=(20,5))
plt.bar(x = df_by_week_2018['week'], height = df_by_week_2018['number_of_strikes'])
plt.plot()
plt.xlabel('Week Number')
plt.xlabel('Number of lightning strikes')
plt.title('Number of lightning strikes per week')
plt.xticks(rotation = 45, fontsize = 8)
plt.show()

Next, plot lightning strikes by quarter for the full date range of available data. For visualization, it will be easiest to work with numbers in millions, such as 25.2 million. As an example, the following code will divide the number_of_strikes column by one million.

df_by_quarter = df['number_of_strikes'].div(1000000)
df_by_quarter.head()

That's the step for which I cleaned the data at the start of the article.

Now, let's make another data frame for the quarterly summary. Here we will also place the data labels on the bars in the visualization. So, we need to create a column for the data labels in the millions format too. I am achieving this by:

Dividing strikes column by a million
Changing the type to float
Rounding the result to one precision point
Converting the type to string
Concatenating with "M" to represent million

# Group 2016-2018 data by quarter and sum
df_by_quarter = df[['quarter','number_of_strikes']].groupby('quarter').sum().reset_index()

# Format as text, in millions
df_by_quarter['number_of_strikes_formatted'] = df_by_quarter['number_of_strikes'].div(1000000).astype(float).round(1).astype(str) + 'M'
df_by_quarter.head()

	quarter	number_of_strikes	number_of_strikes_formatted
0	2016-Q1	2683798	2.7M
1	2016-Q2	15078446	15.1M
2	2016-Q3	21738874	21.7M
3	2016-Q4	1969754	2.0M
4	2017-Q1	2444279	2.4M

# data label function
def addlabels (x, y, labels):
    for i in range(len(x)):
        plt.text(i, y[i], labels[i], ha = 'center', va = 'bottom')

This function iterates over data and plots text labels above each bar of the bar graph.

plt.figure(figsize=(15,5))
plt.bar(x = df_by_quarter['quarter'], height= df_by_quarter['number_of_strikes'])
addlabels(df_by_quarter['quarter'], df_by_quarter['number_of_strikes'], df_by_quarter['number_of_strikes_formatted'])
plt.plot()
plt.xlabel('Quarter')
plt.ylabel('Number of lightning strikes')
plt.title('Number of lightning strikes per quarter (2016-2018)')
plt.show()

We can create a grouped bar chart to better compare year-over-year changes each quarter. We'll do this by creating two new columns that break out the quarter and year from the quarter column. To do this, we use the quarter column and take the last two characters to get quarter_number, and take the first four characters to get year.df_by_quarter['quarter_number'] =

df_by_quarter['quarter_number'] = df_by_quarter['quarter'].str[-2:]
df_by_quarter['year'] = df_by_quarter['quarter'].str[:4]
df_by_quarter.head()

	quarter	number_of_strikes	number_of_strikes_formatted	quarter_number	year
0	2016-Q1	2683798	2.7M	Q1	2016
1	2016-Q2	15078446	15.1M	Q2	2016
2	2016-Q3	21738874	21.7M	Q3	2016
3	2016-Q4	1969754	2.0M	Q4	2016
4	2017-Q1	2444279	2.4M	Q1	2017

plt.figure(figsize = (15, 5))
p = sns.barplot(
    data = df_by_quarter,
    x = 'quarter_number',
    y = 'number_of_strikes',
    hue = 'year')
for b in p.patches:
    p.annotate(str(round(b.get_height()/1000000, 1))+'M', 
                   (b.get_x() + b.get_width() / 2., b.get_height() + 1.2e6), 
                   ha = 'center', va = 'bottom', 
                   xytext = (0, -12), 
                   textcoords = 'offset points')
plt.xlabel("Quarter")
plt.ylabel("Number of lightning strikes")
plt.title("Number of lightning strikes per quarter (2016-2018)")
plt.show()

In this advanced bar graph, we are giving color legend to the year by assigning it to hue parameter. The 'patches' function of pyplot is used to access the bars in the graph. The 'annotate' function of pyplot is used to place labels on these bars. The'xytext' parameter in 'annotate' is used to slightly off set the label from its valued position just for the sake of display. The 'textcoords' parameter set the type of off set for 'xytext'.

Here ends this article with some practice needed to grip these techniques. Thanks for reading. If you like the content, do share it.

I am pasting the GitHub link to access the code and data files.

https://github.com/zahadali/date-manipulation-in-python

The link to the YouTube channel is pasted for learning videos.

https://www.youtube.com/@Analytics_10x

Happy learning and sharing.