Sometimes you have multiple columns of measures for a single purpose, yet you only want to keep the one that performs according to your needs.
In this demo we’ll analyse a synthetic clustering model output dataset. The trick is that we have columns with the distance to the centre of each cluster, but not a column with the cluster assignment itself. In other words, it becomes hard to further analyse the model predictions.
While we’ve made great strides in analysing structured and semi-structured data, unstructured data is still running behind. Of course, this third type presents challenges not found in the other two, but we are steadily finding more options and tools for that job.
One of those tools is, as expected, Artificial Intelligence. Microsoft Azure has some interesting options for this job, namely Cognitive Services. As per their description,
“Cognitive Services brings AI within reach of every developer — without requiring machine-learning expertise”.
Today I bring you a guide on how this description is realized in practice by showing you how to…
I’ve been giving some serious thought as to how I learned Power BI. I do Power BI development for a living, but my learning experience was so all over the place that I’ve been trying to come up with a focused path for other people.
This is my attempt at designing that path. There’s not really a fixed order to these resources, you can go back and forth based on your level of comfort and curiosity.
Note for these tutorials and/or learning resources: don’t be thrown off by having a different UI in your Power BI version or having more…
My career choice is Data Science. However, 3D modeling always appealed had a special place in my heart, probably because video games play a big role role in my free time (pun intended).
I’ve been slowly learning Blender and 3D Modeling since last summer, so today I decided to make a short compilation of the best tutorials I’ve completed. Those that were jam packed with helpful beginner tips and because of the timing or their sheer quality stuck me with the most.
Don’t get me wrong, I’m still a Blender noob, but at this point I’ve gone through several tutorials…
In the previous articles we’ve created four different Jupyter Notebooks that achieve different data transformations and visualizations of the 2020 Stack Overflow Developer Survey data. Today we are moving all of that to the cloud, more specifically to an Azure DataBricks workspace.
As prerequisites, you need an Azure account as well as a valid subscription, and of course an Azure DataBricks (ADB) workspace (check this documentation page on how to do it). After you’ve created the workspace, I will show you how to set up a cluster to run your computations on, and upload the notebooks and the data.
Today we reach the fourth part of this series, the last part about writing code. The fifth and last part will be about moving our notebooks to the cloud, to an Azure DataBricks workspace.
But before that, we need to analyse the programming languages used by the respondents of the 2020 Stack Overflow Developer survey. This column comes in a bad format, as the choices of each developer are put in the same column, separated by a semicolon (;). …
So far in this series we’ve worked with numerical data. Today we’ll analyse the education of respondents to the 2020 Stack Overflow Developer Survey by finding out which are the most frequent education levels.
Thankfully this is a straightforward demo. We just need to map the original options to new values (e.g. “Master’s degree (M.A., M.S., M.Eng., MBA, etc.)” to simply “Master’s degree”), count their frequencies and plot the bar chart.
As usual, here are some handy links to navigate the contents of this series:
In the first part of this series we went through some exploratory data analysis of ages to filter out the bad data, and at the end plot a bar chart with the age frequencies. Today we are working on the annual compensations of the 2020 Stack Overflow Developer Survey results.
This will involve binning the values so that we can plot them in a histogram at the end. For that, we need to create bin labels (to improve the visualization) and the bin intervals. We’ll make plenty use of the wonderful list comprehension feature of Python!
While Plotly can bin…
After posting a handful of separate articles on data analysis with Python, I’ve decided to share some of the work I did on previous personal projects in the form of a proper series.
This “Python Data Analysis” series will consist of five articles tackling different data problems using the 2020 Stack Overflow Developer survey results dataset. I will show you how to use pandas to overcome issues with numeric and categorical data to create nice visualizations with Plotly (Express) at the end.
Although I only show a Python script here, each article has its own Jupyter notebook with the same…
pandas is a wonderful library to work with data in Python. If you’re accustomed to tabular data, then you will feel right at home with this pandas, better yet, while writing Python code. I’ve started working with this library a couple years ago, but I only started using it seriously last year. In this period, I’ve come across many useful functions and so today I will briefly show-off five that have stood out to me for their applications.
Sometimes there is a need for a custom sorting order. If you try to use the
sort_values function in a column with…
I write about data science to help other people who might come across the same problems