-
Anacode Chinese Web Datastore: a collection of crawled Chinese news and blogs in JSON format.
-
AssetMacro, historical data of Macroeconomic Indicators and Market Data.
-
Awesome Public Datasets on github , curated by caesar0301.
-
AWS (Amazon Web Services) Public Data Sets, provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.
-
BigML big list of public data sources.
-
Bioassay data, described in Virtual screening of bioassay data, by Amanda Schierz, J. of Cheminformatics, with 21 Bioassay datasets (Active / Inactive compounds) available for download.
-
Bitly 1.usa.gov data, anonymized clicks on gov links.
-
Canada Open Data, pilot project with many government and geospatial datasets.
-
Causality Workbench data repository.
-
Corral Big Data repository at Texas Advanced Computing Center, supporting data-centric science.
-
Credit Risk Analytics Data: a home equity loans credit data set, mortgage loan level data set, Loss Given Default (LGD) data set and corporate ratings data set.
-
CrowdFlower Data for Everyone library.
-
Data Source Handbook, A Guide to Public Data, by Pete Warden, O’Reilly (Jan 2011).
-
Datacatalogs.org, open government data from US, EU, Canada, CKAN, and more.
-
Data.gov.uk, publicly available data from UK (also London datastore.)
-
Data.gov/Education, central guide for education data resources including high-value data sets, data visualization tools, resources for the classroom, applications created from open data and more.
-
DataMarket, visualize the world’s economy, societies, nature, and industries, with 100 million time series from UN, World Bank, Eurostat and other important data providers.
-
Datamob, public data put to good use.
-
Data Planet, The largest repository of standardized and structured statistical data, with over 25 billion data points, 4.3 billion datasets, 400+ source databases.
-
Datasets.co, datasets for data geeks, find and share Machine Learning datasets.
-
DataSF.org, a clearinghouse of datasets available from the City & County of San Francisco, CA.
-
DataFerrett, a data mining tool that accesses and manipulates TheDataWeb, a collection of many on-line US Government datasets.
-
Delve, Data for Evaluating Learning in Valid Experiments
-
EconData, thousands of economic time series, produced by a number of US Government agencies.
-
data.world, discover and share cool data, connect with interesting people, and work together to solve problems faster.
-
Enron Email Dataset, data from about 150 users, mostly senior management of Enron.
-
Europeana Data ,contains open metadata on 20 million texts, images, videos and sounds gathered by Europeana – the trusted and comprehensive resource for European cultural heritage content.
-
FEDSTATS, a comprehensive source of US statistics and more
-
FIMI repository for frequent itemset mining, implementations and datasets.
-
Financial Data Finder at OSU, a large catalog of financial data sets.
-
GDELT: The Global Data on Events, Location and Tone, described by Guardian as “a big data history of life, the universe and everything.”
-
GEO (GEO Gene Expression Omnibus), a gene expression/molecular abundance repository supporting MIAME compliant data submissions, and a curated, online resource for gene expression data browsing, query and retrieval.
-
GeoDa Center, geographical and spatial data.
-
Google ngrams datasets, text from millions of books scanned by Google.
-
Grain Market Research, financial data including stocks, futures, etc.
-
Hilary Mason research-quality Big Data sets collection – many text and image datasets.
-
HitCompanies Datasets, comprehensive data on random 10,000 UK companies sampled from HitCompanies, updated automatically using AI/Machine Learning.
-
ICWSM-2009 dataset contains 44 million blog posts made between August 1st and October 1st, 2008.
-
Infochimps, an open catalog and marketplace for data. You can share, sell, curate, and download data about anything and everything.
-
Investor Links ,includes financial data
-
JMP Public featured datasets
-
Kaggle Datasets.
-
KDD Cup center, with all data, tasks, and results.
-
Kevin Chai list of datasets, for text, SNA, and other fields.
-
KONECT, the Koblenz Network Collection, with large network datasets of all types in order to perform research in the area of network mining.
-
Linking Open Data project, at making data freely available to everyone.
-
Lyst Fashion Data Trends, tracking 10 million global fashon searches a month, easily and freely accessible to academics as a valuable resource.
-
Million Song Dataset
-
MIT Cancer Genomics gene expression datasets and publications, from MIT Whitehead Center for Genome Research.
-
ML Data, the data repository of the EU Pascal2 networks.
-
NASDAQ Data Store, provides access to market data.
-
National Government Statistical Web Sites, data, reports, statistical yearbooks, press releases, and more from about 70 web sites , including countries from Africa, Europe, Asia, and Latin America.
-
National Space Science Data Center (NSSDC), NASA data sets from planetary exploration, space and solar physics, life sciences, astrophysics, and more.
-
NetworkRepository: Interactive Data Repository, has many collections of graph and networks from social science, machine learning, scientific computing, and other areas.
-
Open Data Census, assesses the state of open data around the world.
-
OpenData from Socrata, access to over 10,000 datasets including business, education, government, and fun.
-
Open Source Sports, many sports databases, including Baseball, Football, Basketball, and Hockey.
-
Peter Skomoroch dataset Bookmarks
-
PubGene(TM) Gene Database and Tools, genomic-related publications database
-
Quandl, a collaboratively curated portal to millions of financial and economic time-series datasets.
-
qunb, a platform to find and visualize quantitative data.
-
Robert Schiller data on housing, stock market, and more from his book Irrational Exuberance.
-
SMD: Stanford Microarray Database, stores raw and normalized data from microarray experiments.
-
Jerry Smith dataset collection, with Finance, Government, Machine Learning, Science, and other data.
-
SourceForge.net Research Data ,includes historic and status statistics on approximately 100,000 projects and over 1 million registered users’ activities at the project management web site.
-
StatLib, CMU Datasets Archive.
-
STATOO Datasets part 1 and STATOO Datasets part 2
-
Time Series Data Library
-
Visual Analytics Benchmark Repository.
-
UCI KDD Database Repository for large datasets used in machine learning and knowledge discovery research.
-
UCI Machine Learning Repository.
-
UCR Time Series Data Archive, offering datasets, papers, links, and code.
-
UK Open Postcode Geo, UK/British postcodes with easting, northing, latitude, and longitude.
-
United States Census Bureau.
-
Web Data Commons, structured data from the Common Crawl, the largest public web corpus.
-
Webhose free datasets
-
Wikiposit, a (virtual) amalgamation of (mostly financial) data from many different sites, allowing users to merge data from different sources
-
Wolfram Alpha disease and patient level data.
-
Yahoo Sandbox datasets, Language, Graph, Ratings, Advertising and Marketing, Competition
-
Yelp Academic Dataset, all the data and reviews of the 250 closest businesses for 30 universities for students and academics to explore and research.
Data Journals
Data-artikelen | Sargasso
Data journalism and data visualization from the Datablog | News | The Guardian
Data Marketplaces and Data Hubs
Knoema – Home
Public Data Sets : Amazon Web Services
Socrata
Data Publica | Les données pour votre business
Archive-It – Web Archiving Services for Libraries and Archives
Freebase
Google Public Data Explorer
Welcome – the Data Hub
Data Sets | AggData
Find & Purchase Data Subscriptions | Windows Azure Marketplace
Factual | Home
Data Search Engines
Zanran Numerical Data Search
Quandl – Intelligent Search for Numerical Data
International Bodies & Agencies
IMF Data and Statistics
Data | The World Bank
OECD.Stat
UNdata
Data and maps — European Environment Agency (EEA)
Eurostat Home
Local Government
Inicio Misiones
Open Government Data Wien (OGD)
Open data – City of Brussels
Open Data – Brisbane City Council
Open data – Salford City Council
Sunderland City Council : Local Public Data
Welcome to the London Datastore | London DataStore
Leeds City Council – Open Data
Home – DataGM – Data Greater Manchester
Open Data | Derby City Council
Council data – Brighton & Hove City Council
Open Data – Birmingham City Council
Aberdeen City Council Open Data
Open Data – City of Waterloo
Open Data catalogue | City of Vancouver
Open Data Home – Open Data – Home | City of Toronto
City of Prince George – Open Data Catalogue
Open Data Ottawa | City of Ottawa
Open Data Catalogue – City of Red Deer
Open Data | City of Niagara Falls, Canada
Open Data Catalogue | City of Nanaimo
Mississauga.ca – Residents – Publications and Open Data Catalogue
City of Medicine Hat Open Data Catalogue
Kamloops open data
Open Data Catalogue Kelowna
City of Hamilton – Open Data
City of Fredericton – Open Data Home
City of Edmonton Open Data Catalogue
City of Somerville, MA
Data.Seattle.Gov | Seattle’s Data Site
City of Scottsdale
Welcome – Santa Cruz Open Data
Data | San Francisco
Open Raleigh – The Official City of Raleigh Portal
Datasets | CivicApps.org Portland OR
OpenDataPhilly – Connecting People With Data
NYC Open Data
Greater New Orleans Community Data Center
City of Madison | Open Data
City and County of Honolulu
US/Data Catalog District of Columbia
Denver Open Data Catalog
data.cookcountyil.gov | The Cook County Government Open Data Website
City of Chicago | Data Portal
Open Government | City of Boston
OpenBaltimore / City of Baltimore’s Open Data Catalog
Data.AustinTexas.gov | Open Austin
OpenDataAsheville – Connecting People With Data
US/Arvada
GovHK: About Data.One
data.gov.sg Singapore
Machine Learning Challenges
ACM KDD CUP
Competitions – Kaggle
Data – Repository – Causality Workbench
TunedIT – Data mining & machine learning data sets, algorithms, challenges
Machine Learning Datasets
TunedIT – Data mining & machine learning data sets, algorithms, challenges
mldata :: Welcome
UCI Machine Learning Repository: Data Sets
Miscellaneous Data Sources
IHME | Institute for Health Metrics and Evaluation
Gapminder: Unveiling the beauty of statistics for a fact based world view.
Doing Research in New York City Public Schools and Requesting Data – NYC Data – New York City Department of Education
RITA | BTS | Title from h2
Oregon Climate Data
Quantnet :: Start
Data Tools – Locators
My Data | Measured Me
Webscope from Yahoo! Labs
SoourceForge.net Research Data
Online Data – Robert Shiller
Obtaining Data From the NSSDC
Cancer Program Data Sets
The Cancer Imaging Archive (TCIA)
Million Song Dataset | scaling MIR research
Google Ngram Viewer
Data | GeoDa Center
Home – GEO DataSets – NCBI
The Financial Data Finder A – G
Frequent Itemset Mining Dataset Repository
Europeana Professional – Linked Open Data
Inforum – EconData
Summary of Data Sets by Application Area
Data Sets | Pew Research Center’s Internet & American Life Project
Cosm – Explore
Advanced NFL Stats: Play-by-Play Data
National Governments and States
Portal de Obligaciones de Transparencia
Junta de Andalucía – Datos abiertos
Reutilización de la Información del Sector Público | Reutilización de la Información de los Servicios Públicos
Portal de Datos Abiertos de JCCM
Ayuntamiento de Zaragoza. Datos de Zaragoza Reutilización
Dades obertes Lleida – Ajuntament de Lleida
ISTAC | El ISTAC
Dades Obertes. Generalitat de Catalunya
Dades Obertes CAIB
Reutilización de la Información del Sector Público en Gijón
Open Data Euskadi ataria, Eusko Jaurlaritzaren datu publikoen irekitzea
Data for Hawaii | data.hawaii.gov
Florida Has A Right To Know
Open.Georgia.gov
Commonwealth Data Point
Open Data | data.maryland.gov
Connecticut Transparency Website
RI.gov: Open Data
NYS Data Center
Maine.gov DataShare
State of Alabama – Open.alabama.gov
Open Government for the State of Tennessee
Ohio.gov | Government | State Facts and History
OpenDoor – Kentucky
Data.Illinois.gov | Open Illinois
SOM – Michigan Data Store
Louisiana Transparency and Accountability Portal
data.mo.gov | State of Missouri Data Portal
DATAshare | data.iowa.gov
Minnesota open data // your portal for Minnesota data transparency
Open Data Texas
Welcome to Oklahoma’s Official Web Site
KanView: Kansas Transparency Taxpayer Act – Kansas Revenues and Expenditures Search
OPEN SD :: South Dakota Government Information
North Dakota GIS (Geographic Information Systems)
State Government Data New Mexico
Colorado.gov: The Official State Web Portal
Arizona OpenBooks | – Arizona Transparency Finances in Detail
Utah Data – Utah.gov
Data.CA.gov | Data Transparency for the State of California
Oregon Data | Opening Oregon’s Data
Data.Washington | Washington State’s Data Site
Home | Data.gov
Portal de Datos Públicos – Inicio
datos.gub.uy | Portal del Estado Uruguayo
Bem vindo – Portal Brasileiro de Dados Abertos
Directorio de Empresas, Marcas registradas, Normas legales y Teléfonos en Perú
StatCentral.ie – The Portal to Ireland’s Official Statistics
data.gov.be | The Belgian open data initiative
Data.overheid.nl: het open dataportaal van de Nederlandse overheid
PortalU – German Environmental Information Portal
Statistical database
Date.gov.md | Portalul datelor guvernamentale deschise al Republicii Moldova
Offene Daten Österreich | data.gv.at
Vitajte – data.gov.sk
dati.gov.it | I dati aperti della PA
Δημοσια, Ανοικτά Δεδομένα
Open Kenya | Transparent Africa
SAUDI | National e-Government Portal – Home
data.govt.nz – New Zealand government data online » Data.govt.nz
data.gov.au
국가공유자원포털
中国政府公开信息整合服务平台
Open Data Canada
OpenGovData.ru
OpenAid – Start
data.norge.no | Åpne offentlige data i Norge – Difi
Portada | datos.gob.es
Open Data Colombia
home | data.gov.uk
Open Companies Data Sources
Yelp’s Academic Dataset | Yelp
Data Export – Prosper
Lending Club Statistics – Lending Club
U.S. Agencies Data Sources
Federal Agency Participation | Data.gov
services.sunlightlabs.com
FRB: Data Download Program (DDP)
Various Lists of Data Sources
Programming Challenges: What are some good “toy problems” in data science? – Quora
Data: Where can I find large datasets open to the public? – Quora
Data Analysis: What’s your favorite free data source? – Quora
What are some publicly available market data feeds? – Quora
Is there a reliable free source for per country LinkedIn statistics? – Quora
@pskomoroch #dataset – Delicious
Free, Public Data Sets | Hacker News
List of European Open Data Catalogues at lod2.okfn.org
Open Data
Datasets Archive
Some Datasets Available on the Web » Data Wrangling Blog
Research Quality Datasets by Hilary Mason
Lending Club Loan Data
SMS Spam Collection
Flickr personal taxonomies
Yahoo Data for Researchers
ICWSM Spinnr Challenge 2011 dataset
Quantum Chaotic Thoughts: Facebook100 Data Set
Public Data Sets on Amazon Web Services (AWS)
The ClueWeb09 Dataset
Census Bureau Home Page
Data | The World Bank
ImageNet
What is Twitter, a Social Network or a News Media? – WWW’10
dotbot | DotNetDotCom.org
arXiv.org help – arXiv Bulk Data Access – Amazon S3
YouTube Dataset
Face Recognition Homepage – Databases
Pajek datasets
UCI Network Data Repository
Datasets for “The Elements of Statistical Learning”
Enron Email Dataset
MovieLens Data Sets | GroupLens Research
Translation Task – EMNLP 2011 Sixth Workshop on Statistical Machine Translation
Project Gutenberg
About WordNet – WordNet – About WordNet
Aligned Hansards of the 36th Parliament of Canada
CRCNS – Collaborative Research in Computational Neuroscience – Data sharing
USENET corpus
UniGene
ChEMBLdb
UCI Machine Learning Repository
Gene Expression Omnibus (GEO) Main page
Social Science Data
IMDB dataset
Stanford Large Network Dataset Collection
Google Books n-gram dataset
Million Song Dataset | scaling MIR research
Belly Button Biodiversity 2.0
Sharing PyPi/Maven dependency data « RTFB
Click Dataset | Center for Complex Networks and Systems Research
The Electric Rice Cooker — One year of deleted weibos archive
Registered meteorites that has impacted on Earth visualized – AnalyticBridge
GeoJSON files for real-time Virginia transportation data.
NYPD Crash Data Band-Aid
11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts | Research Blog
Big data set – 3.5 billion web pages – made available for all of us – Big Data News
Data.Seattle.Gov | Seattle’s Data Site
New Crawl Data Available! | CommonCrawl
Detailed data on pass rates, race, and gender for 2013
Data Download
Sentinel-2
earth observationsatellite imagerygisnatural resourcesustainabilitydisaster response
The Sentinel-2 mission is a land monitoring constellation of two satellites that provide high resolution optical imagery and provide continuity for the current SPOT and Landsat missions. The mission provides a global coverage of the Earth’s land surface every 5 days, making the data of great use in on-going studies. L1C data are available from June 2015 globally. L2A data are available from April 2017 over wider Europe region and globally since December 2018.
Details →
Usage examples
-
Sentinel Hub WMS/WMTS/WCS Service by Sinergise
-
Integrate imagery from the Sentinel-2 archive into your own apps, maps, and analysis with the Sentinel-2 image service by Esri
-
Python package for working with Sentinel-2 AWS data by Sinergise
-
Spectator – tracking Sentinel 2, accessing the data and quick preview by Spectator
-
EOS Land Viewer by Earth Observing System
See 16 usage examples →
Landsat 8
earth observationsatellite imagerygisnatural resourcesustainabilitydisaster response
An ongoing collection of satellite imagery of all land on Earth produced by the Landsat 8 satellite.
Details →
Usage examples
-
Apps for exploring and analyzing Landsat imagery on the fly by Esri
-
Spectator – tracking Landsat 8, accessing the data and quick preview by Spectator
-
COG-Explorer – View Cloud Optimized GeoTIFF images in the browser directly from object storage by EOX
-
Sentinel Playground for Landsat by Sinergise
-
Sentinel Hub WMS/WMTS/WCS Service for Landsat by Sinergise
See 13 usage examples →
Common Crawl
encyclopedicmachine learningnatural language processinginternet
A corpus of web crawl data composed of over 25 billion web pages.
Details →
Usage examples
-
Web Data Commons – RDFa, microdata, and microformat data sets by Christian Bizer, Robert Meusel, Anna Primpeli
-
N-gram counts and language models from the Common Crawl by Christian Buck, Kenneth Heafield, Bas van Ooyen
-
Large-scale graph mining with Spark by Win Suen
-
Using open data to predict market movements by DELL EMC
-
Of using Common Crawl to play Family Feud by Paul Masurel
See 11 usage examples →
IRS 990 Filings
regulatorystatistics
Machine-readable data from certain electronic 990 forms filed with the IRS from 2011 to present.
Details →
Usage examples
-
Guide to Open Data for Nonprofit Research by lecy
-
Open990 by 990 Consulting, LLC
-
Grantmakers.io by Chad Kruse
-
aws-irs-990-explorer by Chris Herbert
-
Tutorial on using the IRS 990 e-file dataset by 990 Consulting, LLC
See 9 usage examples →
Terrain Tiles
elevationearth observationgissustainabilitydisaster response
A global dataset providing bare-earth terrain heights, tiled for easy usage and provided on S3.
Details →
Usage examples
-
Sentinel Playground for DEM by Sinergise
-
R package for accessing Terrain Tiles by Jeffrey W. Hollister
-
EOS Land Viewer by Earth Observing System
-
PODPAC: Python Library supporting TerrainTiles for analysis by Creare
-
Sentinel Hub WMS/WMTS/WCS Service for DEM by Sinergise
See 9 usage examples →
CBERS on AWS
earth observationgisimagingsatellite imagerysustainabilitydisaster response
This project creates a S3 repository with imagery acquired by the China-Brazil Earth Resources Satellite (CBERS). The image files are recorded and processed by Instituto Nacional de Pesquisa Espaciais (INPE) and are converted to Cloud Optimized Geotiff format in order to optimize its use for cloud based applications. The repository contains all CBERS-4 MUX, AWFI, PAN5M and PAN10M scenes acquired since the start of the satellite mission and is daily updated with new scenes.
Details →
Usage examples
-
aws-sat-api-py by Remote Pixel
-
CBERS timelapse GIF generator by Frederico Liporace
-
cbers-tiler by Mapbox
-
Keeping a SpatioTemporal Asset Catalog (STAC) Up To Date with SNS/SQS by Frederico Liporace
-
EOS Land Viewer by Earth Observing System
See 8 usage examples →
SpaceNet on AWS
giscomputer visionmachine learningearth observationdisaster response
A corpus of commercial satellite imagery and labeled training data to foster innovation in the development of computer vision algorithms.
Details →
Usage examples
-
Introducing the SpaceNet Road Detection and Routing Challenge and Dataset by David Lindenbaum
-
SpaceNet: Winning Implementations and New Imagery Release by Todd Stavish
-
Getting Started with SpaceNet Data by Adam Van Etten
-
Building Extraction with YOLT2 and SpaceNet Data by Adam Van Etten
-
2nd SpaceNet Competition Winners Code Release by David Lindenbaum
See 6 usage examples →
NEXRAD on AWS
earth observationnatural resourceweathermeteorologicalsustainability
Real-time and archival data from the Next Generation Weather Radar (NEXRAD) network.
Details →
Usage examples
-
WeatherPipe – Amazon EMR based analysis tool for NEXRAD data stored on Amazon S3 by Stephen Lien Harrell
-
nexradaws on pypi.python.org – python module to query and download Nexrad data from Amazon S3 by Aaron Anderson
-
NEXRAD on EC2 tutorial by openradar
-
Mapping Noaa Nexrad Radar Data With CARTO by Stuart Lynn
-
Using Python to Access NCEI Archived NEXRAD Level 2 Data (Jupyter notebook) by Ryan May
See 5 usage examples →
Global Database of Events, Language and Tone (GDELT)
eventsdisaster response
This project monitors the world’s broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, quotes, images and events driving our global society every second of every day.
Details →
Usage examples
-
Bootstrapping GeoMesa HBase on AWS S3 by Commonwealth Computer Research, Inc.
-
Exploring GDELT with Athena by Julien Simon
-
Creating PySpark DataFrame from CSV in AWS S3 in EMR by Jake Chen
-
Running R on Amazon Athena by Gopal Wunnava
See 4 usage examples →
OpenStreetMap on AWS
mappingosmsustainabilitydisaster response
OSM is a free, editable map of the world, created and maintained by volunteers. Regular OSM data archives are made available in Amazon S3.
Details →
Usage examples
-
Develop and Extract Value from Open Data by Daniel Bernao
-
OSM+Athena (GitHub) by Development Seed
-
Querying OpenStreetMap with Amazon Athena by Seth Fitzsimmons
-
PlanetUtils (GitHub): Scripts and a Docker container to maintain your own OpenStreetMap planet by Interline Technologies
See 4 usage examples →
Sentinel-1
earth observationsatellite imagerygissustainabilitydisaster response
Sentinel-1 is a pair of European radar imaging (SAR) satellites launched in 2014 and 2016. Its 6 days revisit cycle and ability to observe through clouds makes it perfect for sea and land monitoring, emergency response due to environmental disasters, and economic applications. GRD data is available globally since January 2017.
Details →
Usage examples
-
Sentinel Playground by Sinergise
-
Sentinel Hub WMS/WMTS/WCS Service by Sinergise by Sinergise
-
EOS Land Viewer by Earth Observing System
-
EO Browser by Sinergise
See 4 usage examples →
Deutsche Börse Public Dataset
market datafinancial marketstrading
The Deutsche Börse Public Data Set consists of trade data aggregated to one minute intervals from the Eurex and Xetra trading systems. It provides the initial price, lowest price, highest price, final price and volume for every minute of the trading day, and for every tradeable security. If you need higher resolution data, including untraded price movements, please refer to our historical market data product here. Also, be sure to check out our developer’s portal.
Details →
Usage examples
-
10 visualizations to try in Amazon QuickSight with sample data by AWS Big Data Blog
-
Stock Price Movement Prediction Using The Deutsche Börse Public Dataset & Machine Learning by Originate
-
Streaming XETRA Data Using Apache Spark by Thermobook
See 3 usage examples →
GEOS-Chem Input Data
climateweathermeteorologicalenvironmentalair qualitysustainability
Input data for the GEOS-Chem Chemical Transport Model. Including the NASA/GMAO MERRA-2 and GEOS-FP meteorological products, the HEMCO emission inventories, and other small data such as model initial conditions.
Details →
Usage examples
-
Tutorial on accessing GEOS-Chem data bucket in S3 by Jiawei Zhuang
-
Running GEOS-Chem on Cloud Computing Platforms, presented at the 8th International GEOS-Chem Meeting by Jiawei Zhuang, et al.
-
Overview of the GEOSChem-on-cloud project by Atmospheric Chemistry Modeling Group, Harvard University
See 3 usage examples →
MODIS on AWS
gissatellite imagerynatural resourcesustainabilitydisaster response
Select products from the Moderate Resolution Imaging Spectroradiometer (MODIS) managed by the U.S. Geological Survey and NASA.
Details →
Usage examples
-
Sentinel Hub WMS/WMTS/WCS Service for MODIS by Sinergise
-
EOS Land Viewer by Earth Observing System
-
Sentinel Playground for MODIS by Sinergise
See 3 usage examples →
SILO climate data on AWS
climateearth observationenvironmentalmeteorologicalmodelsustainabilitywaterweather
SILO is a database of Australian climate data from 1889 to the present. It provides continuous, daily time-step data products in ready-to-use formats for research and operational applications. Gridded SILO data in annual NetCDF format are on AWS. Point data are available from the SILO website.
Details →
Usage examples
-
Python script to calculate a regional mean by SILO
-
Convert NetCDF to ESRI ArcASCII or GeoTIFF by SILO
-
NetCDF Operators to calculate seasonal means by SILO
See 3 usage examples →
Africa Soil Information Service (AfSIS) Soil Chemistry
agricultureenvironmentalfood securitymachine learninglife sciencessustainability
This dataset contains paired wet and dry chemistry measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. In this release, we include data collected during Phase I (2009-2013.) Georeferenced samples were collected from many Sub-Saharan African countries, and their soil properties were analyzed using both wet and dry chemistry. The two types of data can be paired to form a training dataset for machine learning, such that certain soil properties can be well-predicted through less expensive dry chemistry techniques.
Details →
Usage examples
-
Goalkeepers 2018, Soil – The Big Data Beneath Your Feet by QED
-
AfSIS Soil Chemistry – Usage Tutorial by QED
See 2 usage examples →
Amazon Bin Image Dataset
computer visionmachine learning
The Amazon Bin Image Dataset contains over 500,000 images and metadata from bins of a pod in an operating Amazon Fulfillment Center. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations.
Details →
Usage examples
-
Amazon Inventory Reconciliation using AI by Pablo Rodriguez Bertorello, Sravan Sripada, Nutchapol Dendumrongsup
-
Amazon Bin Image Dataset Challenge by silverbottlep
See 2 usage examples →
Amazon Customer Reviews Dataset
natural language processinginformation retrievalmachine learning
Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazon’s iconic products. In a period of over two decades since the first review in 1995, millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website. Over 130+ million customer reviews are available to researchers as part of this dataset.
Details →
Usage examples
-
How to scale sentiment analysis using Amazon Comprehend, AWS Glue and Amazon Athena by Roy Hasson
-
Implementing a recommender system with Amazon SageMaker and Apache MXNet Gluon by David Arpin
See 2 usage examples →
ECMWF ERA5 Reanalysis
climateearth observationmeteorologicalsustainabilityweather
ERA5 is the fifth generation of ECMWF atmospheric reanalyses of the global climate, and the first reanalysis produced as an operational service. It utilizes the best available observation data from satellites and in-situ stations, which are assimilated and processed using ECMWF’s Integrated Forecast System (IFS) Cycle 41r2. The dataset provides all essential atmospheric meteorological parameters like, but not limited to, air temperature, pressure and wind at different altitudes, along with surface parameters like rainfall, soil moisture content and sea parameters like sea-surface temperature and wave height. ERA5 provides data at a considerably higher spatial and temporal resolution than its legacy counterpart ERA-Interim. ERA5 consists of high resolution version with 31 km horizontal resolution, and a reduced resolution ensemble version with 10 members. It is currently available since 2008, but will be continuously extended backwards, first until 1979 and then to 1950. Learn more about ERA5 in Jon Olauson’s paper ERA5: The new champion of wind power modelling?.
Details →
Usage examples
-
Accessing ERA5 Data on S3 Using Boto by Intertrust Technologies Corporation
-
ERA5 tutorial using the Planet OS API by Intertrust Technologies Corporation
See 2 usage examples →
Hubble Space Telescope Public Data
astronomy
The Hubble Space Telescope (HST) is one of the most productive scientific instruments ever created. This dataset contains calibrated and raw data for all of the currently active instruments on HST: ACS, COS, STIS and WFC3.
Details →
Usage examples
-
Exploring AWS Lambda with cloud-hosted Hubble public data by Arfon Smith
-
Making HST Public Data Available on AWS by Arfon Smith
See 2 usage examples →
NAIP on AWS
earth observationaerial imagerygisnatural resourceregulatorysustainability
The National Agriculture Imagery Program (NAIP) acquires aerial imagery during the agricultural growing seasons in the continental U.S. This “leaf-on” imagery andtypically ranges from 60 centimeters to 100 centimeters in resolutionand is available from the naip-analytic Amazon S3 bucket as 4-band (RGB + NIR) imagery in MRF format, on naip-source Amazon S3 bucket as 4-band (RGB + NIR) in uncompressed Raw GeoTiff format and naip-visualization as 3-band (RGB) Cloud Optimized GeotTiff format. NAIP data is delivered at the state level; every year, a number of states receive updates, with an overall update cycle of two or three years. More details on NAIP
Details →
Usage examples
See 2 usage examples →
NASA NEX
earth observationnatural resourceclimatesustainability
A collection of Earth science datasets maintained by NASA, including climate change projections and satellite images of the Earth’s surface.
Details →
Usage examples
-
Azavea Climate API by Azavea
-
Accessing and plotting NASA-NEX data, from GEOSChem-on-cloud tutorial. by Jiawei Zhuang
See 2 usage examples →
New York City Taxi and Limousine Commission (TLC) Trip Record Data
citiesurbantransportation
Data of trips taken by taxis and for-hire vehicles in New York City.
Details →
Usage examples
-
Deep Dive on Flink & Spark on Amazon EMR by Keith Steward
-
Build a Real-time Stream Processing Pipeline with Apache Flink on AWS by Steffen Hausmann
See 2 usage examples →
Open City Model (OCM)
events
Open City Model is an initiative to provide cityGML data for all the buildings in the United States. By using other open datasets in conjunction with our own code and algorithms it is our goal to provide 3D geometries for every US building.
Details →
Usage examples
-
Using Open City Model with the 3dCityDB by Allen Gilliland
-
Running queries on Open City Model using AWS Athena by Allen Gilliland
See 2 usage examples →
OpenAQ
air qualitycitiesenvironmentalgissustainability
Global, aggregated physical air quality data from public data sources provided by government, research-grade and other sources. These awesome groups do the hard work of measuring these data and publicly sharing them, and our community makes them more universally-accessible to both humans and machines.
Details →
Usage examples
-
hackAIR by hackAir
-
ropenaq R package by Maëlle Salmon
-
Smokey: Air Quality Bot by Amrit Sharma
-
ARISense by Aerodyne Research, Inc.
-
Access OpenAQ data via a filterable SNS topic by OpenAQ
See 5 usage examples →
QIIME 2 User Tutorial Datasets
bioinformaticsbiologydenoisingecosystemsenvironmentalgeneticgenomichealthmachine learningmicrobiomestatistics
QIIME 2 is a powerful, extensible, and decentralized microbiome analysis package with a focus on data and analysis transparency. QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results. This dataset contains the user docs (and related datasets) for QIIME 2.
Details →
Usage examples
-
Installing QIIME 2 using Amazon Web Services by The QIIME 2 Development Team
-
QIIME 2 User Documentation by The QIIME 2 Development Team
See 2 usage examples →
Rapid7 FDNS ANY Dataset
computer securitycyber securityanalyticsinternet
Subset of FDNS ANY queries against domain names produced by Rapid7 Project Sonar, made available in s3. More information on the schema can be found at Rapid7’s Open Data website.
Details →
Usage examples
-
How to Conduct DNS Reconnaissance for $.02 Using Rapid7 Open Data and AWS by Shan Sikdar at Rapid7
-
Creating a Project Sonar FDNS API with AWS by Evan Perotti at SecurityRiskAdvisors
See 2 usage examples →
USGS 3DEP LiDAR Point Clouds
elevationgislidar
The goal of the USGS 3D Elevation Program (3DEP) is to collect elevation data in the form of light detection and ranging (LiDAR) data over the conterminous United States, Hawaii, and the U.S. territories, with data acquired over an 8-year period. This dataset provides two realizations of the 3DEP point cloud data. The first resource is a public access organization provided in Entwine Point Tiles format, which a lossless, streamable octree based on LASzip (LAZ) encoding. The second resource is a Requester Pays of full-density raw LAZ data. Resource names in both buckets correspond to the USGS project names.
Details →
Usage examples
-
WebGL Visualization of USGS 3DEP Lidar Point Clouds with Potree and Plasio.js by Connor Manning
-
Using Lambda Layers with USGS 3DEP LiDAR Point Clouds by Howard Butler
See 2 usage examples →
1000 Genomes
geneticgenomiclife sciences
The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals.
Details →
Usage examples
-
Exploratory data analysis of genomic datasets using ADAM and Mango with Apache Spark on Amazon EMR by Alyssa Marrow
See 1 usage example →
AWS iGenomes
biologygeneticgenomiclife sciences
Common reference genomes hosted on AWS S3. Can be used when aligning and analysing raw DNA sequencing data.
Details →
Usage examples
See 1 usage example →
Allen Brain Observatory – Visual Coding AWS Public Data Set
neurobiologyneuro imagingimage processingmachine learninglife sciences
The Allen Brain Observatory – Visual Coding is the first standardized in vivo survey of physiological activity in the mouse visual cortex, featuring representations of visually evoked calcium responses from GCaMP6-expressing neurons in selected cortical layers, visual areas, and Cre lines.
Details →
Usage examples
See 1 usage example →
Cornell EAS Data Lake
agricultureclimateearth observationelevationenvironmentalgismappingmeteorologicalsustainabilityweather
Earth & Atmospheric Sciences at Cornell University has created a public data lake of climate data. The data is stored in columnar storage formats (ORC) to make it straightforward to query using standard tools like Amazon Athena or Apache Spark. The data itself is originally intended to be used for building decision support tools for farmers and digital agriculture. The first dataset is the historical NDFD / NDGD data distributed by NCEP / NOAA / NWS. The NDFD (National Digital Forecast Database) and NDGD (National Digital Guidance Database) contain gridded forecasts and observations at 2.5km resolution for the Contiguous United States (CONUS). There are also 5km grids for several smaller US regions and non-continguous territories, such as Hawaii, Guam, Puerto Rico and Alaska. NOAA distributes archives of the NDFD/NDGD via its NOAA Operational Model Archive and Distribution System (NOMADS) in Grib2 format. The data has been converted to ORC to optimize storage space and to, more importantly, simplify data access via standard data analytics tools.
Details →
Usage examples
See 1 usage example →
GOES on AWS
gisweatherearth observationmeteorologicalsustainabilitydisaster response
GOES satellites (GOES-16 & GOES-17) provide continuous weather imagery and monitoring of meteorological and space environment data across North America. GOES satellites provide the kind of continuous monitoring necessary for intensive data analysis. They hover continuously over one position on the surface. The satellites orbit high enough to allow for a full-disc view of the Earth. Because they stay above a fixed spot on the surface, they provide a constant vigil for the atmospheric “triggers” for severe weather conditions such as tornadoes, flash floods, hailstorms, and hurricanes. When these conditions develop, the GOES satellites are able to monitor storm development and track their movements.
Details →
Usage examples
-
Billions of Birds Migrate. Where Do They Go? by National Geographic
See 1 usage example →
NOAA Global Forecast System (GFS) Model
climateweatherenvironmentalsustainabilitydisaster response
The Global Forecast System (GFS) is a weather forecast model produced by the National Centers for Environmental Prediction (NCEP). Dozens of atmospheric and land-soil variables are available through this dataset, from temperatures, winds, and precipitation to soil moisture and atmospheric ozone concentration. The entire globe is covered by the GFS at a base horizontal resolution of 18 miles (28 kilometers) between grid points, which is used by the operational forecasters who predict weather out to 16 days in the future. Horizontal resolution drops to 44 miles (70 kilometers) between grid point for forecasts between one week and two weeks.
Details →
Usage examples
See 1 usage example →
Safecast
air qualityclimateenvironmentalgisradiation
An ongoing collection of radiation and air quality measurements taken by devices involved in the Safecast project.
Details →
Usage examples
See 1 usage example →
TCGA on AWS
cancergenomiclife sciences
The Cancer Genome Atlas (TCGA) is a joint effort of the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) to accelerate our understanding of the molecular basis of cancer. TCGA-funded researchers across the United States have produced a corpus of raw and processed genomic, transcriptomic, and epigenomic data from thousands of cancer patients.
Details →
Usage examples
-
Building High-Throughput Genomics Batch Workflows on AWS by Aaron Friedman
See 1 usage example →
Transiting Exoplanet Survey Satellite (TESS)
astronomy
The Transiting Exoplanet Survey Satellite (TESS) is a two-year survey that will discover exoplanets in orbit around bright stars. More information about TESS is available at MASTand the TESS Science Support Center.
Details →
Usage examples
See 1 usage example →
U.S. Census ACS PUMS
statisticscensussurveysustainability
U.S. Census Bureau American Community Survey (ACS) Public Use Microdata Sample (PUMS) available in a linked data format using the Resource Description Framework (RDF) data model.
Details →
Usage examples
See 1 usage example →
Voices Obscured in Complex Enrivonmental Settings (VOiCES)
machine learningautomatic speech recognitionspeaker identificationdenoisingspeech processing
VOiCES is a speech corpus recorded in acoustically challenging settings, using distant microphone recording. Speech was recorded in real rooms with various acoustic features (reverb, echo, HVAC systems, outside noise, etc.). Adversarial noise, either television, music, or babble, was concurrently played with clean speech. Data was recorded using multiple microphones strategically placed throughout the room. The corpus includes audio recordings, orthographic transcriptions, and speaker labels.
Details →
Usage examples
-
Getting started with VOiCES data by M.A. Barrios
See 1 usage example →
Xiph.Org Test Media
computer visionimage processingimagingmachine learningmediamoviesmultimedia
Uncompressed video used for video compression and video processing research.
Details →
Usage examples
See 1 usage example →
3000 Rice Genomes Project
agriculturefood securitygeneticgenomiclife sciences
The 3000 Rice Genome Project is an international effort to sequence the genomes of 3,024 rice varieties from 89 countries.
Details →
A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018)
network trafficinternetintrusion detectioncyber security
This dataset is the result of a collaborative project between the Communications Security Establishment (CSE) and The Canadian Institute for Cybersecurity (CIC) that use the notion of profiles to generate cybersecurity dataset in a systematic manner. It incluides a detailed description of intrusions along with abstract distribution models for applications, protocols, or lower level network entities. The dataset includes seven different attack scenarios, namely Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration of the network from inside. The attacking infrastructure includes 50 machines and the victim organization has 5 departments includes 420 PCs and 30 servers. This dataset includes the network traffic and log files of each machine from the victim side, along with 80 network traffic features extracted from captured traffic using CICFlowMeter-V3. For more information on the creation of this dataset, see this paper by researchers at the Canadian Institute for Cybersecurity (CIC) and the University of New Brunswick (UNB): Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization.
Details →
Broad Genome References
biologybioinformaticscancergeneticgenomic
Broad maintained human genome reference builds hg19/hg38 and decoy references.
Details →
CCAFS-Climate Data
agriculturefood securityclimatesustainability
High resolution climate data to help assess the impacts of climate change primarily on agriculture. These open access datasets of climate projections will help researchers make climate change impact assessments.
Details →
COCO – Common Objects in Context – fast.ai datasets
deep learningcomputer visionmachine learning
COCO is a large-scale object detection, segmentation, and captioning dataset. This is part of the fast.ai datasets collection hosted by AWS for convenience of fast.ai students. If you use this dataset in your research please cite arXiv:1405.0312 [cs.CV].
Details →
DWD COSMO-D2
climatedisaster responseearth observationenvironmentalmachine learningmeteorologicalmodelweather
COSMO-D2 high-resolution, short-range numerical weather prediction model for Germany and adjacent countries; regular grid with 2.2km resolution and 65 vertical levels; updated at 00UTC and every following 3h; forecast range 27h (45h for 03UTC); selection of commonly used parameters
Details →
DWD COSMO-D2 EPS Ensemble
climatedisaster responseearth observationenvironmentalmachine learningmeteorologicalmodelweather
COSMO-D2 EPS high-resolution, short-range numerical weather ensemble prediction model for Germany and adjacent countries; 20 ensemble members, regular grid with 2.2km resolution and 65 vertical levels; updated at 00UTC and every following 3h; forecast range 27h (45h for 03UTC); selection of commonly used parameters; ensemble members are bundled in joint grib files
Details →
DWD ICON Global
climatedisaster responseearth observationenvironmentalmachine learningmeteorologicalmodelweather
ICON global numerical weather prediction model; average resolution of 13km with 90 vertical levels; udpated at 00UTC and every following 6h with a forecast range of 120h (180h for 00UTC and 12UTC); selection of commonly used parameters
Details →
DWD ICON Global EPS Ensemble
climatedisaster responseearth observationenvironmentalmachine learningmeteorologicalmodelweather
ICON global EPS ensemble prediction model; 40 ensemble members; average resolution of 40km; updated at 00UTC and every following 6h with a forecast range of 120h (extended to 180h for 00UTC and 12UTC); selection of commonly used parameters; ensemble members are bundled in joint grib files
Details →
DWD ICON-EU
climatedisaster responseearth observationenvironmentalmachine learningmeteorologicalmodelweather
ICON-EU regional numerical weather prediction model; european nesting region with increased resolution of approximately 6.5km with 60 vertical levels; updated at 00UTC and every following 3h with 120h forecast range; selection of commonly used parameters
Details →
DWD ICON-EU EPS Ensemble
climatedisaster responseearth observationenvironmentalmachine learningmeteorologicalmodelweather
ICON-EU EPS regional ensemble weather prediction model; 40 ensemble members; European nesting region with increased resolution of approximately 20km; updated at 00UTC and every following 3h with 120h forecast range; selection of commonly used parameters; ensemble members are bundled in joint grib files
Details →
District of Columbia – Classified Point Cloud LiDAR
giscitiesus-dcdisaster response
LiDAR point cloud data for Washington, DC is available for anyone to use on Amazon S3. This dataset, managed by the Office of the Chief Technology Officer (OCTO), through the direction of the District of Columbia GIS program, contains tiled point cloud data for the entire District along with associated metadata.
Details →
Downscaled Climate Data for Alaska
climatecoastalearth observationenvironmentalweather
This dataset contains historical and projected dynamically downscaled climate data for the State of Alaska and surrounding regions at 20km spatial resolution and hourly temporal resolution. This data was produced using the Weather Research and Forecasting (WRF) model (Version 3.5). We downscaled both ERA-Interim historical reanalysis data (1979-2015) and both historical and projected runs from 2 GCM’s from the Coupled Model Inter-comparison Project 5 (CMIP5): GFDL-CM3 and NCAR-CCSM4 (historical run: 1970-2005 and RCP 8.5: 2006-2100).
Details →
EPA Risk-Screening Environmental Indicators
environmentalsustainability
Detailed air model results from EPA’s Risk-Screening Environmental Indicators (RSEI) model.
Details →
Genome in a Bottle on AWS
genomiclife sciences
Several reference genomes to enable translation of whole human genome sequencing to clinical practice.
Details →
Global Surface Summary of Day
environmentalclimateweathernatural resourceregulatorysustainability
GSOD is a collection of daily weather measurements (temperature, wind speed, humidity, pressure, and more) from 9000+ weather stations around the world.
Details →
Google Books Ngrams
natural language processing
N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.
Details →
HIRLAM Weather Model
earth observationclimateweathermeteorologicalsustainability
HIRLAM (High Resolution Limited Area Model) is an operational synoptic and mesoscale weather prediction model managed by the Finnish Meteorological Institute.
Details →
ICGC on AWS
cancergenomiclife sciences
The International Cancer Genome Consortium (ICGC) coordinates projects with the common aim of accelerating research into the causes and control of cancer. The PanCancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in whole genomes from ICGC. More than 2,400 consistently analyzed genomes corresponding to over 1,100 unique ICGC donors are now freely available on Amazon S3 to credentialed researchers subject to ICGC data sharing policies.
Details →
Image classification – fast.ai datasets
deep learningcomputer visionmachine learning
Some of the most important datasets for image classification research, including CIFAR 10 and 100, Caltech 101, MNIST, Food-101, Oxford-102-Flowers, Oxford-IIT-Pets, and Stanford-Cars. This is part of the fast.ai datasets collection hosted by AWS for convenience of fast.ai students. See documentation link for citation and license details for each dataset.
Details →
Image localization – fast.ai datasets
deep learningcomputer visionmachine learning
Some of the most important datasets for image localization research, including Camvid and PASCAL VOC (2007 and 2012). This is part of the fast.ai datasets collection hosted by AWS for convenience of fast.ai students. See documentation link for citation and license details for each dataset.
Details →
KITTI Vision Benchmark Suite
autonomous vehiclescomputer visionroboticsmachine learningdeep learning
Dataset and benchmarks for computer vision research in the context of autonomous driving. The dataset has been recorded in and around the city of Karlsruhe, Germany using the mobile platform AnnieWay (VW station wagon) which has been equipped with several RGB and monochrome cameras, a Velodyne HDL 64 laser scanner as well as an accurate RTK corrected GPS/IMU localization unit. The dataset has been created for computer vision and machine learning research on stereo, optical flow, visual odometry, semantic segmentation, semantic instance segmentation, road segmentation, single image depth prediction, depth map completion, 2D and 3D object detection and object tracking. In addition, several raw data recordings are provided. The datasets are captured by driving around the mid-size city of Karlsruhe, in rural areas and on highways. Up to 15 cars and 30 pedestrians are visible per image.
Details →
Multimedia Commons
computer visionmachine learningmultimedia
The Multimedia Commons is a collection of audio and visual features computed for the nearly 100 million Creative Commons-licensed Flickr images and videos in the YFCC100M dataset from Yahoo! Labs, along with ground-truth annotations for selected subsets. The International Computer Science Institute (ICSI) and Lawrence Livermore National Laboratory are producing and distributing a core set of derived feature sets and annotations as part of an effort to enable large-scale video search capabilities. They have released this feature corpus into the public domain, under Creative Commons License 0, so it is free for anyone to use for any purpose.
Details →
NLP – fast.ai datasets
deep learningnatural language processingmachine learning
Some of the most important datasets for NLP, with a focus on classification, including IMDb, AG-News, Amazon Reviews (polarity and full), Yelp Reviews (polarity and full), Dbpedia, Sogou News (Pinyin), Yahoo Answers, Wikitext 2 and Wikitext 103, and ACL-2010 French-English 10^9 corpus. This is part of the fast.ai datasets collection hosted by AWS for convenience of fast.ai students. See documentation link for citation and license details for each dataset.
Details →
NOAA Global Ensemble Forecast System (GEFS)
climatemeteorologicalsustainabilityweather
The Global Ensemble Forecast System (GEFS), previously known as the GFS Global ENSemble (GENS), is a weather forecast model made up of 21 separate forecasts, or ensemble members. The National Centers for Environmental Prediction (NCEP) started the GEFS to address the nature of uncertainty in weather observations, which is used to initialize weather forecast models. The GEFS attempts to quantify the amount of uncertainty in a forecast by generating an ensemble of multiple forecasts, each minutely different, or perturbed, from the original observations. With global coverage, GEFS is produced four times a day with weather forecasts going out to 16 days.
Details →
NOAA Global Historical Climatology Network Daily (GHCN-D)
climatemeteorologicalsustainabilityweather
Global Historical Climatology Network – Daily is a dataset from NOAA that contains daily observations over global land areas. It contains station-based measurements from land-based stations worldwide, about two thirds of which are for precipitation measurement only. Other meteorological elements include, but are not limited to, daily maximum and minimum temperature, temperature at the time of observation, snowfall and snow depth. It is a composite of climate records from numerous sources that were merged together and subjected to a common suite of quality assurance reviews. Some data are more than 175 years old. The data is in CSV format. Each file corresponds to a year from 1763 to present and is named as such.
Details →
NOAA High-Resolution Rapid Refresh (HRRR) Model
climateweatherenvironmentalsustainabilitydisaster response
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Details →
NOAA National Water Model Reanalysis
weatherclimateenvironmentaldisaster responseagriculturetransportationsustainability
The NOAA National Water Model Reanalysis dataset contains output from a 25-year retrospective simulation (January 1993 through December 2017) of version 1.2 of the National Water Model. This simulation used observed rainfall as input and ingested other required meteorological input fields from a weather Reanalysis dataset. The output frequency and fields available in this historical NWM dataset differ from those contained in the real-time forecast model. One application of this dataset is to provide historical context to current real-time streamflow, soil moisture and snowpack NWM conditions. The Reanalysis data can be used to infer flow frequencies and perform temporal analyses with hourly streamflow output and 3-hourly land surface output. The long-term dataset can also be used in the development of end user applications which require a long baseline of data for system training or verification purposes.
Details →
NOAA National Water Model Short-Range Forecast
weatherclimateenvironmentaldisaster responseagriculturetransportationsustainability
The National Water Model (NWM) is a water resources model that simulates and forecasts water budget variables, including snowpack, evapotranspiration, soil moisture and streamflow, over the entire continental United States (CONUS). The model, launched in August 2016, is designed to improve the ability of NOAA to meet the needs of its stakeholders (forecasters, emergency managers, reservoir operators, first responders, recreationists, farmers, barge operators, and ecosystem and floodplain managers) by providing expanded accuracy, detail, and frequency of water information. It is operated by NOAA’s Office of Water Prediction. This bucket contains a four-week rollover of the Short Range Forecast model output and the corresponding forcing data for the model. The model is forced with meteorological data from the High Resolution Rapid Refresh (HRRR) and the Rapid Refresh (RAP) models. The Short Range Forecast configuration cycles hourly and produces hourly deterministic forecasts of streamflow and hydrologic states out to 18 hours.
Details →
NOAA Operational Forecast System (OFS)
climatecoastaldisaster responseenvironmentalmeteorologicalsustainabilityoceanswaterweather
The Operational Forecast System (OFS) has been developed to serve the maritime user community. OFS was developed in a joint project of the NOAA/National Ocean Service (NOS)/Office of Coast Survey, the NOAA/NOS/Center for Operational Oceanographic Products and Services (CO-OPS), and the NOAA/National Weather Service (NWS)/National Centers for Environmental Prediction (NCEP) Central Operations (NCO). OFS generates water level, water current, water temperature, water salinity (except for the Great Lakes) and wind conditions nowcast and forecast guidance four times per day.
Details →
Nanopore Reference Human Genome
genomiclife sciences
This dataset includes the sequencing and assembly of a reference standard human genome (GM12878) using the MinION nanopore sequencing instrument with the R9.4 1D chemistry.
Details →
Open Observatory of Network Interference
internet
A free software, global observation network for detecting censorship, surveillance and traffic manipulation on the internet.
Details →
OpenNeuro
biologyimagingneurobiologyneuro imaging
OpenNeuro is a database of openly-available brain imaging data. The data are shared according to a Creative Commons CC0 license, providing a broad range of brain imaging data to researchers and citizen scientists alike. The database primarily focuses on functional magnetic resonance imaging (fMRI) data, but also includes other imaging modalities including structural and diffusion MRI, electroencephalography (EEG), and magnetoencephalograpy (MEG). OpenfMRI is a project of the Center for Reproducible Neuroscience at Stanford University. Development of the OpenfMRI resource has been funded by the National Science Foundation, National Institute on Drug Abuse, and the Laura and John Arnold Foundation.
Details →
OpenStreetMap Linear Referencing
gistrafficosmsustainabilitydisaster response
OSMLR a linear referencing system built on top of OpenStreetMap. OSM has great information about roads around the world and their interconnections, but it lacks the means to give a stable identifier to a stretch of roadway. OSMLR provides a stable set of numerical IDs for every 1 kilometer stretch of roadway around the world. In urban areas, OSMLR IDs are attached to each block of roadways between significant intersections.
Details →
Physionet
biologylife sciences
PhysioNet offers free web access to large collections of recorded physiologic signals (PhysioBank) and related open-source software (PhysioToolkit).
Details →
Tabula Muris
biologyencyclopedicgenomichealthlife sciencesmachine learningmedicine
Tabula Muris is a compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 100,000 cells from 20 organs and tissues. These data represent a new resource for cell biology, reveal gene expression in poorly characterized cell populations, and allow for direct and controlled comparison of gene expression in cell types shared between tissues, such as T-lymphocytes and endothelial cells from different anatomical locations. Two distinct technical approaches were used for most organs: one approach, microfluidic droplet-based 3’-end counting, enabled the survey of thousands of cells at relatively low coverage, while the other, FACS-based full length transcript analysis, enabled characterization of cell types with high sensitivity and coverage. The cumulative data provide the foundation for an atlas of transcriptomic cell biology. See: https://www.nature.com/articles/s41586-018-0590-4
Details →
The Genome Modeling System
geneticgenomiclife sciences
The Genome Institute at Washington University has developed a high-throughput, fault-tolerant analysis information management system called the Genome Modeling System (GMS), capable of executing complex, interdependent, and automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. GMS includes a full system image with software and services, expandable from one workstation to a large compute cluster.
Details →
The Human Connectome Project
neuro imaginglife sciences
The Human Connectome Project aims to provide an unparalleled compilation of neural data, an interface to graphically navigate this data and the opportunity to achieve never before realized conclusions about the living human brain.
Details →
The Human Microbiome Project
life sciences
The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performed for thousands of these samples. In addition, whole genome sequences were generated for isolate strains collected from human body sites to act as reference organisms for analysis. Finally, 16S marker and whole metagenome sequencing was also done on additional samples from people suffering from several disease conditions.
Details →
The Massively Multilingual Image Dataset (MMID)
computer visionmachine learningmachine translationnatural language processing
MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania. The dataset is doubly parallel: for each language, words are stored parallel to images that represent the word, andparallel to the word’s translation into English (and corresponding images.)
Details →
UK Met Office Atmospheric Deterministic and Probabilistic Forecasts
earth observationclimateweathermeteorologicalsustainability
Meteorological data reusers now have an exciting opportunity to sample, experiment and evaluate Met Office atmospheric model data, whilst also experiencing a transformative method of requesting data via Restful APIs on AWS. All ahead of Met Office’s own operationally supported API platform that will be launched in late 2019.For information about the data see the Met Office website. For examples of using the data check out the examples repository. If you need help and support using the data please raise an issue on the examples repository.
Details →
Unidata GOES-16
gisweatherearth observationmeteorologicalsustainabilitydisaster response
GOES provides continuous weather imagery and monitoring of meteorological and space environment data across North America.
Details →
American Ninja Warrior Obstacle History
multimediaeventssports
Obstacle history of American Ninja Warrior seasons 1-9 This dataset includes every obstacle in the history of American Ninja Warrior from season 1 to 9. This includes the obstacles at Sasuke (also known as the original Ninja Warrior in Japan) during seasons 1-3 when American Ninja Warrior (ANW) was on G4, and the top 10 competitors from the semi-finals round of ANW were sent to Sasuke to compete. Starting in season 4 of ANW, which is known as the “NBC era” when the show took on the regional/city formats for both qualifying and semi-final rounds with the finalists from each region competing at the National Finals of ANW in Las Vegas.
Details →
Usage examples
See 1 usage example →
Cell Painting Image Collection
microscopybiologylife scienceshigh-throughput imagingcell imagingcell paintingfluorescence imaging
The Cell Painting Image Collection is a collection of freely downloadable microscopy image sets. Cell Painting is an unbiased high throughput imaging assay used to analyze perturbations in cell models. In addition to the images themselves, each set includes a description of the biological application and some type of “ground truth” (expected results). Researchers are encouraged to use these image sets as reference points when developing, testing, and publishing new image analysis algorithms for the life sciences. We hope that the this data set will lead to a better understanding of which methods are best for various biological image analysis applications.
Details →
Usage examples
See 1 usage example →
Collection of daily coin data from Coin Metrics
financial marketseconomicsbitcoinblockchain
This project is set to pull the latest daily coin data from Coin Metrics using the data.world sync applet on IFTTT. Daily on-chain transaction volume is calculated as the sum of all transaction outputs belonging to the blocks mined on the given day. “Change” outputs are not included. Transaction count figure doesn’t include coinbase transactions.
Details →
Usage examples
See 1 usage example →
Federal Government Awards
censusgovernment spendingregulatoryus
The Federal Awards dataset contains a complete export of the data available from USASpending. This dataset reflects all observations submitted through the third quarter of fiscal year 2017.
Details →
Usage examples
See 1 usage example →
Medicare Drug Spending
pharmaceuticalstatisticsus
Finding ways to make Medicare drug spending data more consumable.
Details →
Usage examples
-
Datasets for and from the drug-spending channel in the Data for Democracy community. by Data for Democracy, @data4democracy on data.world
See 1 usage example →
NFA 2017 – Ecological Resource Use and Resource Capacity of Nations from 1961 to 2013
environmentalclimateeconomicslife sciencessustainability
Our National Footprint Accounts (NFAs) measure the ecological resource use and resource capacity of nations from 1961 to 2013. The calculations in the National Footprint Accounts are primarily based on United Nations data sets, including those published by the Food and Agriculture Organization, United Nations Commodity Trade Statistics Database, and the UN Statistics Division, as well as the International Energy Agency.
Details →
Usage examples
See 1 usage example →
Swiss Public Transport Stops
citiesgisinfrastructuremappingtraffictransportation
The basic geo-data set for public transport stops comprises public transport stops in Switzerland and additional selected geo-referenced public transport locations that are of operational or structural importance (operating points).
Details →
Usage examples
See 1 usage example →
Translated Sacred Text Word Counts
natural language processingmachine learning
Counts of words used in English-language translations of sacred texts, with flag for common words.
Details →
Usage examples
See 1 usage example →
DigitalGlobe Open Data Program
earth observationdisaster responsegissatellite imagerysustainability
Pre and post event high-resolution satellite imagery in support of emergency planning, risk assessment, monitoring of staging areas and emergency response, damage assessment, and recovery. Also incudes crowdsourced damage assessments for major, sudden onset disasters.
Details →
GATK Test Data
biologybioinformaticscancergeneticgenomic
The GATK test data resource bundle is a collection of files for resequencing human genomic data with the Broad Institute’s Genome Analysis Toolkit (GATK).
Details →
Cross-disciplinary data repositories, data collections and data search engines:
-
http://datasource.kapsarc.org
-
https://www.kaggle.com/datasets
-
http://www.assetmacro.com
-
http://usgovxml.com
-
http://aws.amazon.com/datasets
-
http://databib.org
-
http://datacite.org
-
http://figshare.com
-
http://linkeddata.org
-
http://reddit.com/r/datasets
-
http://thewebminer.com/
-
http://thedatahub.org alias http://ckan.net
-
http://quandl.com
-
Social Network Analysis Interactive Dataset Library (Social Network Datasets)
-
Datasets for Data Mining
-
Enigma Public
-
http://www.ufindthem.com/
-
http://NetworkRepository.com – The First Interactive Network Data Repository
-
http://MLvis.com
-
Open Data Inception – A Comprehensive List of 2500+ Open Data Portals in the World
-
http://data.opendatasoft.com OpenDataSoft catalog
Single datasets and data repositories
-
http://archive.ics.uci.edu/ml/
-
http://crawdad.org/
-
http://data.austintexas.gov
-
http://data.cityofchicago.org
-
http://data.govloop.com
-
http://data.gov.uk/
-
data.gov.in
-
http://data.medicare.gov
-
http://data.seattle.gov
-
http://data.sfgov.org
-
http://data.sunlightlabs.com
-
https://datamarket.azure.com/
-
http://developer.yahoo.com/geo/g…
-
http://econ.worldbank.org/datasets
-
http://en.wikipedia.org/wiki/Wik…
-
http://factfinder.census.gov/ser…
-
http://ftp.ncbi.nih.gov/
-
http://gettingpastgo.socrata.com
-
http://googleresearch.blogspot.c…
-
http://books.google.com/ngrams/
-
http://medihal.archives-ouvertes.fr
-
http://public.resource.org/
-
http://rechercheisidore.fr
-
http://snap.stanford.edu/data/in…
-
http://timetric.com/public-data/
-
https://wist.echo.nasa.gov/~wist…
-
http://www2.jpl.nasa.gov/srtm
-
http://www.archives.gov/research…
-
http://www.bls.gov/
-
http://www.crunchbase.com/
-
http://www.dartmouthatlas.org/
-
http://www.data.gov/
-
http://www.datakc.org
-
http://dbpedia.org
-
http://www.delicious.com/jbaldwi…
-
http://www.faa.gov/data_research/
-
http://www.factual.com/
-
http://research.stlouisfed.org/f…
-
http://www.freebase.com/
-
http://www.google.com/publicdata…
-
http://www.guardian.co.uk/news/d…
-
http://www.infochimps.com
-
http://www.kaggle.com/
-
http://build.kiva.org/
-
http://www.nationalarchives.gov….
-
http://www.nyc.gov/html/datamine…
-
http://www.ordnancesurvey.co.uk/…
-
http://www.philwhln.com/how-to-g…
-
http://www.imdb.com/interfaces
-
http://imat-relpred.yandex.ru/en…
-
http://www.dados.gov.pt/pt/catal…
-
http://knoema.com
-
http://daten.berlin.de/
-
http://www.qunb.com
-
http://databib.org/
-
http://datacite.org/
-
http://data.reegle.info/
-
http://data.wien.gv.at/
-
http://data.gov.bc.ca
-
https://pslcdatashop.web.cmu.edu/ (interaction data in learning environments)
-
http://www.icpsr.umich.edu/icpsrweb/CPES/ – Collaborative Psychiatric Epidemiology Surveys: (A collection of three national surveys focused on each of the major ethnic groups to study psychiatric illnesses and health services use)
-
http://www.dati.gov.it
-
http://dati.trentino.it
-
http://www.databagg.com/
-
http://networkrepository.com – Network/ML data repository w/ visual interactive analytics
-
Home (United Nations Environment Programme Grid)