Top 10 Datasets in Machine Learning


Date Published


Everything can be represented by data making it an essential part of both computing and Machine Learning. The efficiency of Machine Learning relies heavily on its datasets to perform properly. But how do you determine which data set is the best for your project?  Here’s a list of the top 10 free and easily accessible online Machine Learning datasets. 

MSleep Dataset

name: common name

  1. genus: taxonomic rank
  2. vore: carnivor, omnivore or herbivor
  3. order: taxonomic rank
  4. conservation: status of the mammal
  5. sleep_total: total amount of sleep measured in hours
  6. sleep_rem: rem sleep measured in hours
  7. sleepy_cycle: length of sleep cycle measured in hours
  8. awake: time spent awake measured in hours
  9. brainwt: brain weight in kilograms
  10. boydwt: body weight in kilograms

Car Seat Dataset

Consists of the sales of car seats from 400 different store locations with 11 variables. Each of the following variables are measured in increments of thousands.

  1. Sales: unit sales at each location
  2. CompPrice: Price charged by a competitor at each location
  3. Income: Community income level measured in thousands of dollars
  4. Advertising: Local advertising budget for the company at each location
  5. Population: Population size in region
  6. Price: Price the company charges for car seats at each site
  7. ShelveLoc: Measured in Bad, Good, and Medium indicating the quality of the shelving location for the car seats at each location
  8. Age: Average age of the local population
  9. Education: Education level at each location
  10. Urban: Yes/ No to indicate if the store is in an urban or rural location
  11. US: Yes/No to indicate if the store is in the US or not

Diamond Dataset

Contains information regarding almost 54,000 diamonds with ten variables.

  1. Carat: the weight of the diamond
  2. Cut: quality of the diamond measured from Fair, Good, Very Good, Premium, Ideal
  3. Color: color of the diamond measured from D, the best, to J, the worst
  4. Clarity: how clear the diamond measured by the following scale (worst to best): I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF
  5. Depth: total depth percentage, calculated using the x, y, and z variables
  6. Table: width of the top of the diamond in relation to its widest point
  7. Price: amount in USD
  8. X: length in millimeters
  9. Y: width in millimeters
  10. Z: length in millimeters

Free Spoken Digit Dataset

This dataset allows you to contribute your recordings of spoken digits as long as they are 8kHz wav files and in English. The recordings are also trimmed at the beginning and end for minimal silence. As an open dataset, it is expected to grow over time as contributions trickle in. This dataset hopes to solve digit pronunciation problems and at the time of this post, consists of six speakers, with 3,000 recordings (50 of each digit per speaker).

The Wikipedia Corpus

Wikipedia is not only a resource for students with research papers, but also a very useful tool for Natural Language Processing researchers.  This dataset consists of nearly 1.9 billion words from more than 4 million Wikipedia articles that can be searched by words, phrases, and paragraphs. 

Face Image Dataset

Subjects of this dataset consist mostly of male and female adults, ranging between the ages of 18-20 years old, from various ethnicities. The objective of this dataset is to help distinguish not only between genders but also emotions. As part of the dataset, images with a resolution of 180*200 pixels were taken of the female and female subjects. In total, nearly 400 individuals participated with 20 images taken per each subject. Now, anyone can download this dataset as a zip file.

Spam SMS Classifier

Ham or spam?  This dataset helps predict whether a text is ham (legit) or spam.  Consisting of more than 5,500 messages in English, this dataset is beginner-friendly and simple to comprehend.  By using a comma-separated value format and one message per line made up of two columns: v1, ham or spam, and v2, the raw text this data set is novice approved.

Fashion MNIST Dataset

Like the Spam SMS Classifier dataset, this dataset is beginner-friendly and useful in understanding the techniques and deep learning recognition pattern of real-world data.  With over 70,000, 28×28, grayscale pixel images, this set was created to replace the original MNIST dataset to become the new benchmark for algorithms.  In this dataset each pixel has a pixel-value integer running from 0 to 255 associated with it, the bigger numbers representing the darkest pixel.

Breat Cancer Wisconsin (Diagnostic) Dataset

Used often to help with classification problems in machine learning, this dataset describes the cell nuclei characteristics present in the image with the following real-valued features:

  1. Radius 
  2. Texture (standard deviation of gray-scale values)
  3. Perimeter 
  4. Area
  5. Smoothness 
  6. Compactness  (perimeter^2 / area – 1.0) 
  7. Concavity
  8. Concave points
  9. Symmetry
  10. Fractal dimension 

Iris Flower Dataset

Used by R.A. Fisher, statistical science genius, in 1936 this dataset can still be used to build simple projects in machine learning algorithms and is beginner-friendly.  The dataset is small and consists of four attributes all measured in centimeters: sepal length, sepal width, petal length and petal width with three classes: Virginica, Setosa and Versicolor. 

Creating datasets for machine learning is a laborious human task, but luckily there are several public datasets available.  The datasets mentioned above are user-friendly, but rest assured there are plenty of other accessible datasets available for use, regardless of your project or use case.