Introduction to Data Visualization with the Matplotlib library
Data visualization is essential nowadays. After all, it’s easier to communicate information extract insights from a graph than from rows of data.
Today I want to give you a brief introduction to the Python plotting library, Matplotlib. In this article, we’ll go through creating some of the most common visualizations, from line and scatter plots to histograms and pie charts.
Install Matplotlib
For starters, we need to get Matplotlib installed. As usual in Python, open the terminal and enter
pip install matplotlib
More often than not, you will probably be using pandas DataFrames to process data coming from external sources like a CSV file. However, for the sake of keeping these examples simple and focusing only on the Matplotlib code, we will use hard-coded data for the visualizations.
How to create a visualization in Matplotlib
As we’ll see throughout the examples, to create a plot or chart in the Matplotlib library, we always go through four main steps:
- Import the pyplot module of the Matplotlib library
- Load the data to be plotted (in our case it will be hard-coded as Python lists)
- Call the appropriate plotting function (for instance, line plots are created with the
plot()
function and scatter plots withscatter()
) - Call the
show()
method to show the created plot
With only these four steps, we can create any basic plot or chart available in Matplotlib, varying only the data used and the plotting function called.
The remaining functionalities showed in the following examples will include modifications such as adding labels to the axis, removing the ticks from an axis, etc.
Thus, what I will show you next are short demonstrations on how to create six different visualizations in Matplotlib and the respective results. Hope you find this introduction useful to your data visualization needs :)
Line Plot
We’ll start with the common line plot. This is one of the most common visualizations and it’s good to show how a variable changes over time or to spot trends in data.
In this example, we’ll plot the doubles and the squares of all the integers between 1 and 10, inclusive, to see how quickly these two groups of numbers grow.
To reiterate, we only need four things to create the line plot:
- The pyplot module of the Matplotlib library (we create an alias for it,
plt
, to shorten the name of the module) - The data to be plotted (the
nums
, thedoubles
and thesquares
lists) - To call the
plot()
function to create one line plot of the numbers against their doubles and another of the numbers against their squares - To call the
show()
method to show the resulting plot
With that said, there are some things to explain because this is the first example.
First up, the need to call show()
is because every change we make affects the same plot. For example, even though we create two line plots with the plot()
function, the second doesn’t overwrite the first. Instead, they are both drawn in the same graph. Even the formatting changes, like adding labels to the axis with xlabel()
and ylabel()
or removing the padding around the created graph with tight_layout()
, these are all changes to the same graph. Until a graph is “finished” by calling the show()
function, all changes will be applied to the same graph.
Whatever text is passed to the label
argument will then be used to label the respective plot in the legend. This legend is created automatically when calling the legend()
function and, by default, is drawn on the top left corner of the graph. The position of the legend can be changed by specifying it when calling the function.
Lastly, about the savefig()
function. This function is used to save your current plot as an image file, such as PNG. You only need to pass it a string with the name of the resulting image as an argument. But make sure to save before calling show()
, if you call it afterwards your plots will have been wiped at that point.
And now that we’ve gone through the code, here is the resulting line plot.
If you run this Python code on your local code editor, the show()
method will create a new window to display the created graph. This window includes some helpful options for adjusting the axis margins, saving the graph as an image, zoom and more.
With that said, all of this options available in the graphical interface can be achieved in the code thanks to the many formatting and plotting options available in this library. If you want to learn more about Matplotlib, I do encourage you to check out the examples page in their website. Matplotlib is incredibly diversified for data visualization and this article is only meant to get you started.
Scatter Plot
Next up we have the scatter plot. This is a good visualization for identifying relationships between two variables or simply to see how your data is distributed.
In the scatter plot example, we are plotting the favorite numbers of a group of twenty people. So,we call the scatter()
function and make use of the marker
argument to draw the dots on the plot as “x”s.
For the formatting, we remove the X axis ticks by setting the ticks to an empty list (by calling xticks([])
). The remaining changes are similar to the ones used in the line plot.
And so this is the resulting graph for the scatter plot.
Pie Chart
Pie charts are another fairly common visualization. These are great for showing the relative sizes for each component of a complete “pie”, such as a company.
Coincidentally, we will create a pie chart to show the relative sizes of the departments at a fake company.
To create a pie chart, we use the pie()
function. For the effect, we have three lists with information for each slice/wedge of the pie:
- The sizes (number of employees per department)
- The slices’ labels
- The colors to fill the slices
Additionally, when calling the pie()
function, we specify that we want relative values of each wedge to be displayed by including the autopct
argument.
We pass autopct
the string %1.2f%%
:
- The first percent sign signifies that we are adding a formatting value
- 1.2f means that we want to display floating-point numbers with two decimal places
- The two percent signs at the add the percent sign to the displayed values
Of course, you can change this as you see fit, such as not including the percent sign, the number of decimal places or not including these values at all.
And so this is our resulting pie chart:
Column/Bar Chart
Next up we have the column and bar chart. Column and bar charts are essentially the same type of visualization, except that the former has a vertical orientation while the latter has an horizontal orientation. However, both are great for comparing the results of categories of data.
Unlike histograms, both column and bar charts represent discrete data, that is, if a person has one dog, then they have one complete dog, they can’t have one dog and a half or a dog and three quarters. On the other hand, the columns/bins/class intervals in a histogram represent a range of continuous (possibly floating-point) values, like heights or weights.
In our examples, we’ll display the number of times each option was chosen when asking a group of forty individuals how long they’ve been working with Python (using fake data of course). Since the difference between creating a column chart and a bar chart is calling bar()
or barh()
, respectively, I’ve included both examples.
First the example for the column chart:
Pretty standard code compared to what we’ve seen so far, resulting in this column chart.
And as discussed before, we can plot the column chart as a bar chart simply by using barh()
instead of bar()
(don’t forget the axis are swapped though!).
Histogram
Histograms are similar to the previously explored column/bar charts, with the difference that this visualization represents continuous values separated into class intervals or bins. For example, histograms show the distribution of the ages of a group of people, separated in ten year intervals, or their heights separated in intervals of fifteen centimeters.
In this example, we have the ages of a group of one hundred people,which are then separated into class intervals of ten years. In other words, the first column shows how many people are between zero and ten years-old, exclusive, the second shows how many people are between ten and twenty years-old, exclusive, and so on.
Again, the code shown in this example is similar to what we’ve seen before, the only difference here is that we used the hist()
function to create a histogram and passed it the color
and edgecolor
arguments to change the color and outline of the bars.
The following is our resulting histogram.
Fill Between (Line Plot)
Our last visualization is not an entirely new visualization per se, rather, it’s a modification to the line plots we’ve seen before. Sometimes, when plotting multiple lines, it would be good to have more visual information on the graph to see how the plots compare.
For instance, if we are plotting the sales of our company and the sales of the competition in the same graph, it could help the interpretation if the areas in-between the plots were filled, especially if they were filled with one color when our company sold more and another when the competition sold more.
And that’s what we’ll do in the following example. We’ll draw two lines plots (exactly as shown before), one for the sells of our company and another for the sales of the competition, over a period of ten days.
Then, we will call the fill_between()
functio to fill two types of areas. In the first instance, we fill the areas where the sales of our company are higher, that is, where the values of the company_sales
list are greater. In the second instance, we fill the areas where the sales of the competition are greater or equal to those of our company, where the values of the competition_sales
are greater are equal to those of company_sales
.
One particularity about fill_betwen()
though. While all other plotting functions have worked with normal Python lists, fill_between()
requires NumPy arrays. Thankfully, the installation of Matplotlib includes NumPy as well, so we only need add the import of NumPy and save our usual lists of data as arrays. plot()
, and all the other plotting functions we’ve seen, can work with both lists and arrays.
When plotting the competition sales, we change the line style to be dashed (linestyle="--"
). In the calls to fill_between()
, we pass it the data for the X axis followed by the Y axis values that set the bounds in which to fill with color. The conditions passed to the where
argument are straightforward: only fill the areas where the specified condition is met. It’s also useful to pass the argument interpolate
with a value of True
, as it will help Matplotlib fill areas where the two line plots intersect. Lastly, we specify the color of the filled area, the transparency (alpha
) and the label of the filled area.
And so, this is our resulting line plot. The competition sales are drawn in a dashed style to make it look more like the results to beat. To be more easily distinguishable where our company outperformed the competition, those areas are filled in green, while the remaining areas are filled in red.
Conclusion
As you’ve seen, visualizing data in Python is made easy with Matplotlib. Creating the plots themselves is easy enough and there’s plenty of room to format them, what was shown here is just the tip of the iceberg. As mentioned in the line plots, if you’d like to learn more about this fantastic library, I encourage you to check out their website that includes the official documentation as well as a page with plenty more examples. You can find the examples page here.
As promised at the beginning of the article, you can find all the code samples, saved PNGs and a complete Jupyter Notebook on this GitHub repository. Alternativelty, you can run that Jupyter Notebook on the web in this Google Colab.
Hope this article gave you some ideas for your future data visualization options :)