Everything can be represented by data making it an essential part of both computing and Machine Learning. The efficiency of Machine Learning relies heavily on its datasets to perform properly. But how do you determine which data set is the best for your project? Here’s a list of the top 10 free and easily accessible online Machine Learning datasets.
name: common name
Consists of the sales of car seats from 400 different store locations with 11 variables. Each of the following variables are measured in increments of thousands.
Contains information regarding almost 54,000 diamonds with ten variables.
This dataset allows you to contribute your recordings of spoken digits as long as they are 8kHz wav files and in English. The recordings are also trimmed at the beginning and end for minimal silence. As an open dataset, it is expected to grow over time as contributions trickle in. This dataset hopes to solve digit pronunciation problems and at the time of this post, consists of six speakers, with 3,000 recordings (50 of each digit per speaker).
Wikipedia is not only a resource for students with research papers, but also a very useful tool for Natural Language Processing researchers. This dataset consists of nearly 1.9 billion words from more than 4 million Wikipedia articles that can be searched by words, phrases, and paragraphs.
Subjects of this dataset consist mostly of male and female adults, ranging between the ages of 18-20 years old, from various ethnicities. The objective of this dataset is to help distinguish not only between genders but also emotions. As part of the dataset, images with a resolution of 180*200 pixels were taken of the female and female subjects. In total, nearly 400 individuals participated with 20 images taken per each subject. Now, anyone can download this dataset as a zip file.
Ham or spam? This dataset helps predict whether a text is ham (legit) or spam. Consisting of more than 5,500 messages in English, this dataset is beginner-friendly and simple to comprehend. By using a comma-separated value format and one message per line made up of two columns: v1, ham or spam, and v2, the raw text this data set is novice approved.
Like the Spam SMS Classifier dataset, this dataset is beginner-friendly and useful in understanding the techniques and deep learning recognition pattern of real-world data. With over 70,000, 28×28, grayscale pixel images, this set was created to replace the original MNIST dataset to become the new benchmark for algorithms. In this dataset each pixel has a pixel-value integer running from 0 to 255 associated with it, the bigger numbers representing the darkest pixel.
Used often to help with classification problems in machine learning, this dataset describes the cell nuclei characteristics present in the image with the following real-valued features:
Used by R.A. Fisher, statistical science genius, in 1936 this dataset can still be used to build simple projects in machine learning algorithms and is beginner-friendly. The dataset is small and consists of four attributes all measured in centimeters: sepal length, sepal width, petal length and petal width with three classes: Virginica, Setosa and Versicolor.
Creating datasets for machine learning is a laborious human task, but luckily there are several public datasets available. The datasets mentioned above are user-friendly, but rest assured there are plenty of other accessible datasets available for use, regardless of your project or use case.