

In ODDS, we openly provide access to a large collection of outlier detection datasets with ground truth (if available). Our focus is to provide datasets from different domains and present them under a single umbrella for the research community. As such, we arrange the datasets based on their types into different tables in the order as listed below. [read more about ODDS]
Multi-dimensional point datasets: There is one record per data point, and each record contains several attributes.
Time series graph datasets for event detection: Temporal graph data where the graph changes dynamically over time in which new nodes and edges arrive or existing nodes and edges disappear.
Time series point datasets (Multivariate/Univariate): Temporal point data where each point has one or more attributes and the attributes change over time.
Adversarial/Attack scenario and security datasets: Opinion fraud detection data from online review system. Cyber security data, e.g. intrusion detection with DoS, DDoS etc. attack scenario.
Crowded scene video data for anomaly detection: Video clips acquired with camera.
Multi-dimensional point datasets
Search:
Dataset | #points | #dim. | #outliers (%) |
---|---|---|---|
Lympho | 148 | 18 | 6 (4.1%) |
WBC | 278 | 30 | 21 (5.6%) |
Glass | 214 | 9 | 9 (4.2%) |
Vowels | 1456 | 12 | 50 (3.4%) |
Cardio | 1831 | 21 | 176 (9.6%) |
Thyroid | 3772 | 6 | 93 (2.5%) |
Musk | 3062 | 166 | 97 (3.2%) |
Satimage-2 | 5803 | 36 | 71 (1.2%) |
Letter Recognition | 1600 | 32 | 100 (6.25%) |
Speech | 3686 | 400 | 61 (1.65%) |
Pima | 768 | 8 | 268 (35%) |
Satellite | 6435 | 36 | 2036 (32%) |
Shuttle | 49097 | 9 | 3511 (7%) |
BreastW | 683 | 9 | 239 (35%) |
Arrhythmia | 452 | 274 | 66 (15%) |
Ionosphere | 351 | 33 | 126 (36%) |
Mnist | 7603 | 100 | 700 (9.2%) |
Optdigits | 5216 | 64 | 150 (3%) |
Http (KDDCUP99) | 567479 | 3 | 2211 (0.4%) |
ForestCover | 286048 | 10 | 2747 (0.9%) |
Mulcross | 262144 | 4 | 26214 (10%) |
Smtp (KDDCUP99) | 95156 | 3 | 30 (0.03%) |
Mammography | 11183 | 6 | 260 (2.32%) |
Annthyroid | 7200 | 6 | 534 (7.42%) |
Pendigits | 6870 | 16 | 156 (2.27%) |
Ecoli | 336 | 7 | 9 (2.6%) |
Wine | 129 | 13 | 10 (7.7%) |
Vertebral | 240 | 6 | 30 (12.5%) |
Yeast | 1364 | 8 | 64 (4.7%) |
Seismic | 2584 | 11 | 170 (6.5%) |
Heart | 224 | 44 | 10 (4.4%) |
OSAD Benchmark Datasets | Multiple datasets | — | — |
One-class dataset by David Tax | Multiple datasets | — | — |
Time series graph datasets for event detection
Search:
Dataset | #nodes | duration | description |
---|---|---|---|
EnronInc | 80,884 | 4 years | Email communication network over time in Enron Inc. |
RealityMining | 9104 | 50 weeks | communication and proximity data of 97 faculty, student, and staff at MIT . |
TwitterWorldCup2014 | 54K | 1 month | Entity co-mention network from twitter related to 2014 Soccer World Cup. |
TwitterSecurity2014 | 130K | 4 months | Entity co-mention network from twitter related to terrorism and domestic security. |
NYTNews | 320K | 7.5 years | Entity co-mention graph for New York Times News Corpus over 7.5 years. |
ChallengeNetwork | 125 | 9 days | Simulated cyber challenge network traffic flow data. |
VAST2012MC2 | 5K | 2 days | Bank of Money Regional Office Network Operations Forensics. |
VAST2013MC3 | 1.2K | 2 weeks | Big Marketing computer network flow data. |
VAST2014 | — | 3 days | Timestamped text, network, and transaction data from GAStech. |
Time series point datasets (Multivariate/Univariate)
Search:
Dataset | Type | Size | Duration | Description |
---|---|---|---|---|
DataMarket – TSDL | Univariate | Multiple datasets | — | The Time Series Data Library (TSDL) was created by Rob Hyndman, Professor of Statistics at Monash University, Australia. |
Yahoo – a benchmark dataset for TSAD | Multivariate | between 741 and 1680 observations per series at regular interval | 367 time series | This dataset is released by Yahoo Labs to detect unusual traffic on Yahoo servers. |
Numenta Anomaly Benchmark (NAB) | Multivariate | Multiple datasets | — | Numenta Anomaly Benchmark, a benchmark for streaming anomaly detection where sensor provided time-series data is utilized. |
Adversarial/Attack scenario and security datasets
Search:
Dataset | Size | Description |
---|---|---|
YelpCHI | 67,395 hotel and restaurant reviews | Reviews from Yelp.com for Chicago Hotels and Restaurants. |
YelpNYC | 359,052 restaurant reviews | Reviews from Yelp.com for NYC restaurants |
YelpZip | 608,598 restaurant reviews | Zip code wise reviews from Yelp.com for NY, NJ, VT, CT, and PA. |
YelpAcademic | 2.7M yelp reviews | Reviews of various businesses from Yelp.com for academic challenge. |
AmazonReview | 34,686,770 product reviews | Reviews from Amazon.com |
SWMReview | 1, 132, 373 reviews | SWM Review dataset contains reviews under the entertainment category from a popular online software marketplace. |
BeerAdvocate | 1,586,259 beer reviews | Beer reviews from BeerAdvocate |
RateBeer | 2,924,127 beer reviews | Beer reviews from RateBeer |
CellarTracker | 2,025,995 wine reviews | Wine reviews from CellarTracker |
FineFoods | 568,454 food reviews | Food reviews from Amazon |
Movies | 7,911,684 movie reviews | Movie reviews from Amazon |
AZSecure-data | Multiple datasets | Data Science Testbed for Security Researchers |
CAIDA datasets | Multiple datasets | Collection and sharing site of data for scientific analysis of Internet traffic, topology, routing, performance, and security-related events. |
DARPA intrusion detection | Multiple datasets | The Cyber Systems and Technology Group of MIT Lincoln Laboratory, under DARPA ITO and AFRL/SNHS sponsorship, has collected and distributed the first standard corpora of intrusion detection datasets. |
KDDCUP99 | 4,900K connection records | The dataset includes a wide variety of intrusions simulated in a military network environment. |
MAWI Working Group Traffic Archive | 2006 – present collection | This is a traffic data repository maintained by the MAWI Working Group of the WIDE Project where traffic traces are collected at some sampling points everyday. |
MOME | Multiple datasets | Cluster of European Projects aimed at Monitoring and Measurement. |
Waikato Internet Traffic Storage | Multiple datasets | The Waikato Internet Traffic Storage project aims to collect and document all the Internet traces that the WAND Group has in their possession. |
RIPE | Multiple datasets (currently ~100TB) | The RIPE Data Repository is a collection of diverse datasets that are useful for scientific and operational Internet research. |
The Internet Traffic Archive | Multiple datasets | The Internet Traffic Archive is a moderated repository to support widespread access to traces of Internet network traffic, sponsored by ACM SIGCOMM. |
UMassTraceRepository | Multiple datasets | The UMass Trace Repository provides network, storage, and other traces to the research community for analysis. |
Crowded scene video data for anomaly detection
Search:
Dataset | size | description |
---|---|---|
UCSD Anomaly Detection Dataset | 98 video clips | The UCSD anomaly detection annotated dataset was acquired with a stationary camera mounted at an elevation, overlooking pedestrian walkways. |
University of Minnesota crowd activity datasets | Multiple datasets | Data for monitoring human activity by University of Minnesota. |
Anomalous Behavior Data Set | Multiple datasets | Datasets for anomalous behavior detection in videos. |
Virat video dataset | ~8.5 hours of videos | This is a video surveillance data for human activity/event detection. |
McGill University Dominant and Rare Event Detection Data | 3 video clips (43, 96 mins) | This is a video surveillance data for dominant and rare event detection captured by cameras from a subway station. |