Date Manipulation in Python for Data Analysis

·

7 min read

In this article, we will look at advanced date manipulation with Python. From dates, we will extract weeks, months, quarters, years, and their multiple combinations necessary for our data analysis.

After this, we would make visualizations of our data summarized on these combinations. We would also see how to add data labels in the bar graph.

First, we import the necessary libraries.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Load the data.

df = pd.read_csv('lightning_strikes_2016.csv', on_bad_lines='skip')
df2 = pd.read_csv('lightning_strikes_2018.csv', on_bad_lines='skip')

The parameter on_bad_lines handles the rows that have more entries than the number of columns.

Next, I have done some data cleaning by filtering some rows out. Later in the article, when I was using div() function on the count of strikes, I got an error. It was due to dirty string entries there. So, I identified the inconsistent dates in the data and filtered those out.

arr = ['225.5)', '20)', '2016INT(-82.1 22.3)',
       '2016-07-146-07-14', '2012)', '2016-0.3 50.2)', '2016-07-15 18.5)',
       '20T(-75.5 41.9)', '203.5)', '2016-07-49.5)', '2016', '2014',
       '2016--07-14', '2016-08 45.2)', '2016-07-1T(-87.3 31.9)',
       '2016-07-1416-07-14', '2016-07-141', '2016-09 22.9)',
       '2016-07-0 50.7)', '2034.6)', '2016-0.9)', '2016-07-5',
       '2016-07-7-15', '2016-07-1NT(-92.9 33.4)', '2015',
       '2016-07-16-07-15', '2016-07-15 19.6)', '2016-',
       '2016-07NT(-79 35.7)', '2016-07-T(-84.8 19.9)', '2016-07- 25.2)',
       '2016--98.8 40.4)', '215', '2016016-07-15', '2016(-81.1 21.8)',
       '20-07-15', '216-07-15', '2016-07-016-07-15']
df = df[~df['date'].isin(arr)]
df = df[df['number_of_strikes'].apply(lambda x: isinstance(x, int))]

After filtering those rows, I am converting the data type of all the entries of the column 'number_of_strikes' to an integer type. For this, I am using lambda function to avoid 'for' loop. A lambda function takes the inputs from an iterable and applies the defined operation on all the entries of that iterable.

Now, combine both data sets.

df = pd.concat([df,df2])

The date entries are in the string format. For date manipulation, we need to convert those into the datatime format.

df['date'] = pd.to_datetime(df['date'])

We are creating four new columns each for week, month, quarter, and year. We are using 'dt.strftime()' function of the datetime object to extract these values. The parameter after % sign returns the numerical values of the entity. For example, %Y returns the numerical value of year. Similarly %V, %m, and %q return the numerical values of the week, month, and quarter. Any string part in between these two percent parameters is printed as it is. In '%Y-W%V' the sample output would be 2016-W12.

df['week'] = df['date'].dt.strftime('%Y-W%V')
df['month'] = df['date'].dt.strftime('%Y-%m')
df['quarter'] = df['date'].dt.to_period('Q').dt.strftime('%Y-Q%q')
df['year'] = df['date'].dt.strftime('%Y')

Here is the output of the above manipulation.

df.head(10)
datenumber_of_strikescenter_point_geomweekmonthquarteryear
02016-01-0455POINT(-83.2 21.1)2016-W012016-012016-Q12016
12016-01-0433POINT(-83.1 21.1)2016-W012016-012016-Q12016
22016-01-0546POINT(-77.5 22.1)2016-W012016-012016-Q12016
32016-01-0528POINT(-76.8 22.3)2016-W012016-012016-Q12016
42016-01-0528POINT(-77 22.1)2016-W012016-012016-Q12016
52016-01-0530POINT(-76.7 22.3)2016-W012016-012016-Q12016
62016-01-0534POINT(-76.8 22.4)2016-W012016-012016-Q12016
72016-01-0631POINT(-74.2 25.9)2016-W012016-012016-Q12016
82016-01-0624POINT(-76 22.9)2016-W012016-012016-Q12016
92016-01-0625POINT(-75.3 22.7)2016-W012016-012016-Q12016

Now. let's plot a bar chart for the number of strikes per week in the year 2018. For this, we would filter the data frame for the year 2018 and then apply the grouping function.

df_by_week_2018 = df[df['year'] == '2018'][['week','number_of_strikes']].groupby('week').sum().reset_index()
df_by_week_2018.head()
weeknumber_of_strikes
02018-W0134843
12018-W02353425
22018-W0337132
32018-W04412772
42018-W0534972

Now plot a bar graph using matplotlib.pyplot

plt.bar(x = df_by_week_2018['week'], height = df_by_week_2018['number_of_strikes'])
plt.plot()
plt.xlabel('Week Number')
plt.xlabel('Number of lightning strikes')
plt.title('Number of lightning strikes per week')

The labels on x-axis are quite mingled up. We cannot see these labels. To fix these, we would use the xticks() function of pyplot to rotate and size the fonts in labels on x-axis.

plt.figure(figsize=(20,5))
plt.bar(x = df_by_week_2018['week'], height = df_by_week_2018['number_of_strikes'])
plt.plot()
plt.xlabel('Week Number')
plt.xlabel('Number of lightning strikes')
plt.title('Number of lightning strikes per week')
plt.xticks(rotation = 45, fontsize = 8)
plt.show()

Next, plot lightning strikes by quarter for the full date range of available data. For visualization, it will be easiest to work with numbers in millions, such as 25.2 million. As an example, the following code will divide the number_of_strikes column by one million.

df_by_quarter = df['number_of_strikes'].div(1000000)
df_by_quarter.head()

That's the step for which I cleaned the data at the start of the article.

Now, let's make another data frame for the quarterly summary. Here we will also place the data labels on the bars in the visualization. So, we need to create a column for the data labels in the millions format too. I am achieving this by:

  • Dividing strikes column by a million

  • Changing the type to float

  • Rounding the result to one precision point

  • Converting the type to string

  • Concatenating with "M" to represent million

# Group 2016-2018 data by quarter and sum
df_by_quarter = df[['quarter','number_of_strikes']].groupby('quarter').sum().reset_index()

# Format as text, in millions
df_by_quarter['number_of_strikes_formatted'] = df_by_quarter['number_of_strikes'].div(1000000).astype(float).round(1).astype(str) + 'M'
df_by_quarter.head()
quarternumber_of_strikesnumber_of_strikes_formatted
02016-Q126837982.7M
12016-Q21507844615.1M
22016-Q32173887421.7M
32016-Q419697542.0M
42017-Q124442792.4M
# data label function
def addlabels (x, y, labels):
    for i in range(len(x)):
        plt.text(i, y[i], labels[i], ha = 'center', va = 'bottom')

This function iterates over data and plots text labels above each bar of the bar graph.

plt.figure(figsize=(15,5))
plt.bar(x = df_by_quarter['quarter'], height= df_by_quarter['number_of_strikes'])
addlabels(df_by_quarter['quarter'], df_by_quarter['number_of_strikes'], df_by_quarter['number_of_strikes_formatted'])
plt.plot()
plt.xlabel('Quarter')
plt.ylabel('Number of lightning strikes')
plt.title('Number of lightning strikes per quarter (2016-2018)')
plt.show()

We can create a grouped bar chart to better compare year-over-year changes each quarter. We'll do this by creating two new columns that break out the quarter and year from the quarter column. To do this, we use the quarter column and take the last two characters to get quarter_number, and take the first four characters to get year.df_by_quarter['quarter_number'] =

df_by_quarter['quarter_number'] = df_by_quarter['quarter'].str[-2:]
df_by_quarter['year'] = df_by_quarter['quarter'].str[:4]
df_by_quarter.head()
quarternumber_of_strikesnumber_of_strikes_formattedquarter_numberyear
02016-Q126837982.7MQ12016
12016-Q21507844615.1MQ22016
22016-Q32173887421.7MQ32016
32016-Q419697542.0MQ42016
42017-Q124442792.4MQ12017
plt.figure(figsize = (15, 5))
p = sns.barplot(
    data = df_by_quarter,
    x = 'quarter_number',
    y = 'number_of_strikes',
    hue = 'year')
for b in p.patches:
    p.annotate(str(round(b.get_height()/1000000, 1))+'M', 
                   (b.get_x() + b.get_width() / 2., b.get_height() + 1.2e6), 
                   ha = 'center', va = 'bottom', 
                   xytext = (0, -12), 
                   textcoords = 'offset points')
plt.xlabel("Quarter")
plt.ylabel("Number of lightning strikes")
plt.title("Number of lightning strikes per quarter (2016-2018)")
plt.show()

In this advanced bar graph, we are giving color legend to the year by assigning it to hue parameter. The 'patches' function of pyplot is used to access the bars in the graph. The 'annotate' function of pyplot is used to place labels on these bars. The'xytext' parameter in 'annotate' is used to slightly off set the label from its valued position just for the sake of display. The 'textcoords' parameter set the type of off set for 'xytext'.

Here ends this article with some practice needed to grip these techniques. Thanks for reading. If you like the content, do share it.

I am pasting the GitHub link to access the code and data files.

https://github.com/zahadali/date-manipulation-in-python

The link to the YouTube channel is pasted for learning videos.

https://www.youtube.com/@Analytics_10x

Happy learning and sharing.