These projects test more intermediate classification skills, like using convolutional neural networks (CNN). Any of these datasets and project ideas are great for those who have experience working with machine learning.
As a company, we have to buy our books ahead of time. We took out a loan last month to buy our original batch of books. The value of the loan was the total cost of all the books that we bought.
We made some money back through customers buying our books last month. Next month, we know which books we will be sending to which customers, but we do not know who will buy what books.
The question for you to answer is: Will we be able to both pay back our loan and afford our next book purchase order?
You should create some sort of machine learning model for answering this take-home as opposed to simply looking at the average conversion rate or something like that).
However, we do not expect you to build models from scratch. NumPy, SciPy, sci-kit-learn, and everything else are all fair game.
7. Predicting Breast Cancer with Deep Learning
Health informatics is a fast-growing field in data science, and there’s a wide range of applications of machine learning in healthcare. This Python project uses the IDC (Invasive Ductal Carcinoma) dataset and asks you to build a model to predict IDC breast cancer. You could also work on a similar project using the UCI Breast Cancer dataset. )
How to Do the Project: This tutorial walks you through using Python - along with the Keras library - to build a convolutional neural network.
You might also want to check out the ImageNet dataset, a great source for CNN projects, or this tutorial for building a CNN with Python.
8. Conversion Rate Modeling
One use case for classification is building prediction models, specifically related to marketing and conversions. The challenge for projects like these is finding reliable data sources.
One option is this Clicks and Conversion Tracking dataset on Kaggle, which features the social media marketing performance of an anonymous brand. If you’re looking for another source, check out this conversion rate dataset on Github.
How to Do the Project: There are numerous models you can create to predict conversions, but here’s a helpful tutorial that examines using decision trees to predict conversion rate.
9. Music Genre Classification Project
Building genre classification models will allow you to practice intermediate Python techniques, including K-nearest neighbor and random forest algorithms, as well as the Librosa library.
There are numerous datasets you can use. While the Million Song Dataset is one of the best, there are also music datasets in the data world.
How to Do the Project: Here’s a helpful tutorial that looks at using content-based filters for music genre classification.
10. Speech Emotion Recognition
The RAVDESS dataset features 7,000+ files, in which actors express various emotions while speaking. In terms of building speech recognition models, this dataset is one of the most comprehensive out there.
You might also want to check out data sources like the LSSED: A Large-Scale Dataset and Benchmark for Speech Emotion Recognition or see this list of emotion recognition datasets.
How to Do the Project: This tutorial walks you through using a convolutional neural network to examine RAVDESS data.
11. News Article Categorization
With the increasing volume of news articles available on the internet, classifying them into different categories can be helpful in organizing and filtering content for users. In this project, you’ll build a machine-learning model to classify news articles into various categories, such as politics, technology, sports, and entertainment.
You can start by using the BBC News Classification Dataset, which contains over 2,000 news articles categorized into five classes: business, entertainment, politics, sports, and tech. You can experiment with various classification algorithms, such as Naive Bayes, k-Nearest Neighbors, and Support Vector Machines.
How to do the Project: Follow this tutorial on Analytics Vidhya that demonstrates how to perform text classification using various machine learning algorithms, such as Naive Bayes, Logistic Regression, and Support Vector Machines.
12. German Credit Data Analysis
The German Credit dataset provides insights into the factors that financial institutions consider when determining the creditworthiness of an applicant. Featuring a mix of numerical and categorical attributes, this dataset presents opportunities for various forms of data analysis, machine learning, and prediction modeling. By understanding the correlations and patterns within this data, one can develop predictive models to determine the likelihood of an approval based on an applicant’s credit, or even spot potential biases in the credit decision-making process.
The dataset has been sourced from Professor Dr. Hans Hofmann of the Universität Hamburg. It comprises 1000 instances with attributes capturing an applicant’s financial behavior, history, and personal details. For instance, it includes attributes such as the status of the applicant’s checking account, credit history, purpose for the loan, and personal information like age and job type. Two versions of the dataset are provided: the original dataset (german.data
), which contains a mix of numerical and categorical attributes, and a modified dataset (german.data-numeric
), formatted for algorithms that prefer numerical input, wherein categorical variables have been transformed into numerical indicators.
How to do the Project: Download this dataset from GitHub and identify the features in the dataset using this .names file. Dive into the German Credit dataset, assess the patterns and build a predictive model for credit approval. Begin with german.data
to grasp the categorical essence of the data and consider using german.data-numeric
for algorithmic requirements.