The Power of Visualization - Bar Plot in Python

·

4 min read

Table of contents

No heading

No headings in the article.

People can retain 65% of the information three days after watching an image with data compared to 10% of the information they hear. According to a study conducted by MIT, 90% of the information transmitted to our brain is visual.

Visualizations are essential in data analysis whether it be EDA, advanced analysis, statistical study, or model assumptions verification. Visualizations help get insights from data more conveniently than raw or organized data. For a specific dataset, if the stuff is properly structured and visualized, we can make informed decisions more quickly than scanning and comparing the data entries in the table format.

In this article, I am going to make a simple visualization of a bar plot for the data of lightning strikes collected by the National Oceanic and Atmospheric Administration (NOAA).

I would be using Panda's library of Python and manipulating the dates to compare month-over-month lightning stats.
So, starting here, I would first import all the required libraries.

import pandas as pd

import numpy as np

import datetime as dt

import matplotlib.pyplot as plt

Now, I am loading the dataset into the panda's dataframe.

df = pd.read_csv('eda_structuring_with_python_dataset1.csv')

Inspect the first ten rows to get an idea of the data fields.

df.head(10)

Notice that the data is structured as one row per day along with the geometric location of the strike.

Get the size of the data frame i.e. the count of rows and columns.

df.shape

info will provide the total number of rows (3,401,012) and columns (3). It will also state the names and data types of each column, as well as the size of the dataframe in memory.

In this case, notice that the date column is an 'object' type rather than a 'date' type. Objects are strings. When dates are encoded as strings, they cannot be manipulated as easily. Converting string dates to datetime will enable you to work with them much more easily.

Let's convert to datetime using the pandas function to_datetime()

df['date'] = pd.to_datetime(df['date'])

Now, I can use the date to find the top ten days with the most strikes.

df[['date','number_of_strikes']].groupby(['date']).sum().sort_values('number_of_strikes', ascending = False).head(10)

Notice that I haven't use the 'center_point_geom' column at this time because that is irrelevant to my summary of strike count. So, I am keeping the least necessary information in use.

Now, I want to get the month with the greatest number of strikes because summarizing each day of the year on bar plot visualization is less beneficial. So, for month wise summary, I need to get the month out-of-date column. I created a new month column using the dt.month attribute of the datetime class instance.

df['month'] = df['date'].dt.month
df.head()

However the month number is not attractive for a visualization. So, I need to get the name of the month for useful information. I would use another attribute of the datetime to get the month name and strip it to the first three letters only.

df['month_txt'] = df['date'].dt.month_name().str.slice(stop = 3)
df.head()

With this, now I can visualize my month over month strikes trend and spot out the month with the highest strikes.

I am creating another dataframe containing the month wise summary table.

As an DIY, search the reset_index() method and learn its functionality.

df_by_month = df[['month','month_txt','number_of_strikes']].groupby(['month','month_txt']).sum().sort_values('month', ascending = True).reset_index()
df_by_month

I have used 'month' column for sorting purpose because 'month_txt', if used for sorting, will be sorted in alphabetical order which does not serve our purpose of sorting.

Now, let's make a bar chart. Pyplot's plt.bar() function takes positional arguments of x and height, representing the data used for the x- and y- axes, respectively. The x-axis will represent months, and the y-axis will represent strike count.

plt.bar(x = df_by_month['month_txt'], height = df_by_month['number_of_strikes'], label = 'Number of Strikes')
plt.plot()

plt.xlabel('Months (2018)')
plt.ylabel('Number of Lightening Strikes')
plt.title('Number of Lightening Strikes in 2018 by Months')
plt.legend()
plt.show()

From the above bar plot, we can easily identify August as the month with the highest number of lightning strikes.

Hurrah! This visualization made your life much easier!

Thanks for reading. If you like the content, do share it.

Thank you! Happy learning and sharing.