Atmospheric Models from Météo-France climatedisaster responseearth observationenvironmentalmachine learningmeteorologicalmodelsustainabilityweather Global and high-resolution regional atmospheric models from Météo-France. ARPEGE World covers the entire world at a base horizontal resolution of 0.5° (~55km) between grid points, it predicts weather out up to 114 hours in the future.

ARPEGE Europe covers Europe and North-Africa at a base horizontal resolution of 0.1° (~11km) between grid points, it predicts weather out up to 114 hours in the future.

AROME France covers France at a base horizontal resolution of 0.025° (~2.5km) between grid points, it predicts weather out up to 42 hours in the future.

AROME France HD covers France and neigborhood at a base horizontal resolution of 0.01° (~1.5km) between grid points, it predicts weather out up to 42 hours in the future. Dozens of atmospheric variables are avail... Dozens of atmospheric variables are avail... Details → Usage examples Windguru.cz by Windguru

Windy.com by Windy See 2 usage examples →

RarePlanes computer visiondeep learningearth observationgeospatiallabeledmachine learningsatellite imagery RarePlanes is a unique open-source machine learning dataset from CosmiQ Works and AI.Reverie that incorporates both real and synthetically generated satellite imagery. The RarePlanes dataset specifically focuses on the value of AI.Reverie synthetic data to aid computer vision algorithms in their ability to automatically detect aircraft and their attributes in satellite imagery. Although other synthetic/real combination datasets exist, RarePlanes is the largest openly-available very high resolution dataset built to test the value of synthetic data from an overhead perspective. The real portion ... Details → Usage examples RarePlanes: Synthetic Data Takes Flight by Jacob Shermeyer, Thomas Hossler, Adam Van Etten, Daniel Hogan, Ryan Lewis, Daeil Kim

RarePlanes Codebase by Thomas Hossler and Jacob Shermeyer See 2 usage examples →

Terra Fusion Data Sampler geospatialsatellite imagerysustainability The Terra Basic Fusion dataset is a fused dataset of the original Level 1 radiances from the five Terra instruments. They have been fully validate to contain the original Terra instrument Level 1 data. Each Level 1 Terra Basic Fusion file contains one full Terra orbit of data and is typically 15 – 40 GB in size, depending on how much data was collected for that orbit. It contains instrument radiance in physical units; radiance quality indicator; geolocation for each IFOV at its native resolution; sun-view geometry; bservation time; and other attributes/metadata. It is stored in HDF5, conformed to CF conventions, and accessible by netCDF-4 enhanced models. It’s naming convention follows: TERRA_BF_L1B_OXXXX_YYYYMMDDHHMMSS_F000_V000.h5. A concise description of the dataset, along with links to complete documentation and available software tools, can be found on the Terra Fusion project page: https://terrafusion.web.illinois.edu.Terra is the flagship satellite of NASA’s Earth Observing System (EOS). It was launched into orbit on December 18, 1999 and carries five instruments. These are the Moderate-resolution Imaging Spectroradiometer (MODIS), the Multi-angle Imaging SpectroRadiometer (MISR), the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER), the Clouds and Earth’s Radiant Energy System (CERES), and the Measurements of Pollution in the Troposphere (MOPITT).The Terra Basic Fusion dataset is an easy-to-access record of the Level 1 radiances for instruments on... Details → Usage examples TerraFusion GitHub by University of Illinois

Basic Terra fusion product algorithm theoretical basis and data specifications by Zhao, Guangu; Yang, Muqun; Clipp, Landon; Gao, Yizhao; Lee, Joe H. See 2 usage examples →

Ford Multi-AV Seasonal Dataset autonomous vehiclescomputer visionlidarmappingroboticstransportationurbanweather This research presents a challenging multi-agent seasonal dataset collected by a fleet of Ford autonomous vehicles at different days and times during 2017-18. The vehicles The vehicles were manually driven on an average route of 66 km in Michigan that included a mix of driving scenarios like the Detroit Airport, freeways, city-centres, university campus and suburban neighbourhood, etc. Each vehicle used in this data collection is a Ford Fusion outfitted with an Applanix POS-LV inertial measurement unit (IMU), four HDL-32E Velodyne 3D-lidar scanners, 6 Point Grey 1.3 MP Cameras arranged on the... Details → Usage examples Ford AV Dataset Tutorial by Ford Motor Company See 1 usage example →

Human Cancer Models Initiative (HCMI) Cancer Model Development Center cancergenomiclife sciencesSTRIDESwhole genome sequencing The Human Cancer Models Initiative (HCMI) is an international consortium that is generating novel, next-generation, tumor-derived culture models annotated with genomic and clinical data. HCMI-developed models and related data are available as a community resource. The NCI is contributing to the initiative by supporting four Cancer Model Development Centers (CMDCs). CMDCs are tasked with producing next-generation cancer models from clinical samples. The cancer models include tumor types that are rare, originate from patients from underrepresented populations, lack precision therapy, or lack ca... Details → Usage examples Genomic Data Commons by National Cancer Institute See 1 usage example →

NOAA National Digital Forecast Database (NDFD) climatemeteorologicalsustainabilityweather The National Digital Forecast Database (NDFD) is a suite of gridded forecasts of sensible weather elements (e.g., cloud cover, maximum temperature). Forecasts prepared by NWS field offices working in collaboration with the National Centers for Environmental Prediction (NCEP) are combined in the NDFD to create a seamless mosaic of digital forecasts from which operational NWS products are generated. Details → Usage examples NDFD Product Spreadsheet (excel file) by NOAA MDL See 1 usage example →

NOAA National Water Model Reanalysis agricultureclimatedisaster responseenvironmentalsustainabilitytransportationweather The NOAA National Water Model Reanalysis dataset contains output from multi-decade retrospective simulations. These simulations used observed rainfall as input and ingested other required meteorological input fields from a weather reanalysis dataset. The output frequency and fields available in this historical NWM dataset differ from those contained in the real-time forecast model. One application of this dataset is to provide historical context to current real-time streamflow, soil moisture and snowpack NWM conditions. The reanalysis data can be used to infer flow frequencies and perform temp... Details → Usage examples Simulating storm surge and compound flooding events with a creek-to-ocean model: Importance of baroclinic effects by Fei Ye, et al. See 1 usage example →

NOAA Operational Forecast System (OFS) climatecoastaldisaster responseenvironmentalmeteorologicaloceanssustainabilitywaterweather The Operational Forecast System (OFS) has been developed to serve the maritime user community. OFS was developed in a joint project of the NOAA/National Ocean Service (NOS)/Office of Coast Survey, the NOAA/NOS/Center for Operational Oceanographic Products and Services (CO-OPS), and the NOAA/National Weather Service (NWS)/National Centers for Environmental Prediction (NCEP) Central Operations (NCO). OFS generates water level, water current, water temperature, water salinity (except for the Great Lakes) and wind conditions nowcast and forecast guidance four times per day. Details → Usage examples OFS Data Aggregation and Sub-Setting by NOAA See 1 usage example →

New Jersey Statewide Digital Aerial Imagery Catalog aerial imageryearth observationgeospatialimagingmapping The New Jersey Office of GIS, NJ Office of Information Technology manages a series of 11 digital orthophotography and scanned aerial photo maps collected at various years ranging from 1930 to 2017. Each year’s worth of imagery are available as Cloud Optimized GeoTIFF (COG) files and some years are available as compressed MrSID and/or JP2 files. Additionally, each year of imagery is organized into a tile grid scheme covering the entire geography of New Jersey. Many years share the same tiling grid while others have unique grids as defined by the project at the time. Details → Usage examples Visualize Imagery Changes by stephanie.bosits@tech.nj.gov See 1 usage example →

New Jersey Statewide LiDAR elevationgeospatiallidarmapping Elevation datasets in New Jersey have been collected over several years as several discrete projects. Each project covers a geographic area, which is a subsection of the entire state, and has differing specifications based on the available technology at the time and project budget. The geographic extent of one project may overlap that of a neighboring project. Each of the 18 projects contains deliverable products such as LAS (Lidar point cloud) files, unclassified/classified, tiled to cover project area; relevant metadata records or documents, most adhering to the Federal Geographic Data Com... Details → Usage examples 3D Visualization by stephanie.bosits@tech.nj.us See 1 usage example →

SILAM Air Quality air qualityclimateearth observationmeteorologicalsustainabilityweather Air Quality is a global SILAM atmospheric composition and air quality forecast performed on a daily basis for > 100 species and covering the troposphere and the stratosphere. The output produces 3D concentration fields and aerosol optical thickness. The data are unique: 20km resolution for global AQ models is unseen worldwide. Details → Usage examples Simple examples by Roope Tervo See 1 usage example →

Tabula Muris biologyencyclopedicgenomichealthlife sciencesmachine learningmedicine Tabula Muris is a compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 100,000 cells from 20 organs and tissues. These data represent a new resource for cell biology, reveal gene expression in poorly characterized cell populations, and allow for direct and controlled comparison of gene expression in cell types shared between tissues, such as T-lymphocytes and endothelial cells from different anatomical locations. Two distinct technical approaches were used for most organs: one approach, microfluidic droplet-based 3’-end counting, enabled the s... Details → Usage examples Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. by Tabula Muris Consortium (2019) See 1 usage example →

Voices Obscured in Complex Environmental Settings (VOiCES) automatic speech recognitiondenoisingmachine learningspeaker identificationspeech processing VOiCES is a speech corpus recorded in acoustically challenging settings, using distant microphone recording. Speech was recorded in real rooms with various acoustic features (reverb, echo, HVAC systems, outside noise, etc.). Adversarial noise, either television, music, or babble, was concurrently played with clean speech. Data was recorded using multiple microphones strategically placed throughout the room. The corpus includes audio recordings, orthographic transcriptions, and speaker labels. Details → Usage examples Getting started with VOiCES data by M.A. Barrios See 1 usage example →

A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018) cyber securityinternetintrusion detectionnetwork traffic This dataset is the result of a collaborative project between the Communications Security Establishment (CSE) and The Canadian Institute for Cybersecurity (CIC) that use the notion of profiles to generate cybersecurity dataset in a systematic manner. It incluides a detailed description of intrusions along with abstract distribution models for applications, protocols, or lower level network entities. The dataset includes seven different attack scenarios, namely Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration of the network from inside. The attacking infrastructure incl... Details →

A2D2: Audi Autonomous Driving Dataset autonomous vehiclescomputer visiondeep learninglidarmachine learningmappingrobotics An open multi-sensor dataset for autonomous driving research. This dataset comprises semantically segmented images, semantic point clouds, and 3D bounding boxes. In addition, it contains unlabelled 360 degree camera images, lidar, and bus data for three sequences. We hope this dataset will further facilitate active research and development in AI, computer vision, and robotics for autonomous driving. Details →

CCAFS-Climate Data agricultureclimatefood securitysustainability High resolution climate data to help assess the impacts of climate change primarily on agriculture. These open access datasets of climate projections will help researchers make climate change impact assessments. Details →

COCO - Common Objects in Context - fast.ai datasets computer visiondeep learningmachine learning COCO is a large-scale object detection, segmentation, and captioning dataset. This is part of the fast.ai datasets collection hosted by AWS for convenience of fast.ai students. If you use this dataset in your research please cite arXiv:1405.0312 [cs.CV]. Details →

Crowdsourced Bathymetry earth observationoceanssustainability Community provided bathymetry data collected in collaboration with the International Hydrographic Organization. Details →

District of Columbia - Classified Point Cloud LiDAR citiesdisaster responsegeospatialus-dc LiDAR point cloud data for Washington, DC is available for anyone to use on Amazon S3. This dataset, managed by the Office of the Chief Technology Officer (OCTO), through the direction of the District of Columbia GIS program, contains tiled point cloud data for the entire District along with associated metadata. Details →

Downscaled Climate Data for Alaska climatecoastalearth observationenvironmentalsustainabilityweather This dataset contains historical and projected dynamically downscaled climate data for the State of Alaska and surrounding regions at 20km spatial resolution and hourly temporal resolution. Select variables are also summarized into daily resolutions. This data was produced using the Weather Research and Forecasting (WRF) model (Version 3.5). We downscaled both ERA-Interim historical reanalysis data (1979-2015) and both historical and projected runs from 2 GCM’s from the Coupled Model Inter-comparison Project 5 (CMIP5): GFDL-CM3 and NCAR-CCSM4 (historical run: 1970-2005 and RCP 8.5: 2006-2100). Details →

EPA Risk-Screening Environmental Indicators environmentalsustainability Detailed air model results from EPA’s Risk-Screening Environmental Indicators (RSEI) model. Details →

Epoch of Reionization Dataset astronomy The data are from observations with the Murchison Widefield Array (MWA) which is a Square Kilometer Array (SKA) precursor in Western Australia. This particular dataset is from the Epoch of Reionization project which is a key science driver of the SKA. Nearly 2PB of such observations have been recorded to date, this is a small subset of that which has been exported from the MWA data archive in Perth and made available to the public on AWS. The data were taken to detect signatures of the first stars and galaxies forming and the effect of these early stars and galaxies on the evolution of the u... Details →

Genome Ark bioinformaticsbiologygeneticgenomiclife sciences The Genome Ark hosts genomic information for the Vertebrate Genomes Project (VGP) and other related projects. The VGP is an international collaboration that aims to generate complete and near error-free reference genomes for all extant vertebrate species. These genomes will be used to address fundamental questions in biology and disease, to identify species most genetically at risk for extinction, and to preserve genetic information of life. Details →

Genome in a Bottle on AWS genomiclife sciences Several reference genomes to enable translation of whole human genome sequencing to clinical practice. Details →

Global Surface Summary of Day climateenvironmentalnatural resourceregulatorysustainabilityweather GSOD is a collection of daily weather measurements (temperature, wind speed, humidity, pressure, and more) from 9000+ weather stations around the world. Details →

Google Books Ngrams natural language processing N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token. Details →

HIRLAM Weather Model climateearth observationmeteorologicalsustainabilityweather HIRLAM (High Resolution Limited Area Model) is an operational synoptic and mesoscale weather prediction model managed by the Finnish Meteorological Institute. Details →

High Resolution Population Density Maps + Demographic Estimates by CIESIN and Facebook aerial imagerydemographicsdisaster responsegeospatialimage processingmachine learningpopulationsatellite imagerysustainability Population data for a selection of countries, allocated to 1 arcsecond blocks and provided in a combination of CSV and Cloud-optimized GeoTIFF files. This refines CIESIN’s Gridded Population of the World using machine learning models on high-resolution worldwide Digital Globe satellite imagery. CIESIN population counts aggregated from worldwide census data are allocated to blocks where imagery appears to contain buildings. Details →

Human PanGenomics Project genomiclife sciences This dataset includes sequencing data, assemblies, and analyses for the offspring of ten parent-offspring trios. Details →

IChangeMyCity Complaints Data from Janaagraha citiesciviccomplaintsmachine learning The IChangeMyCity project provides insight into the complaints raised by citizens from diffent cities of India related to the issues in their neighbourhoods and the resolution of the same by the civic bodies. Details →

IRS 990 Filings (Spreadsheets) economicsregulatorystatisticsus Excerpts of electronic Form 990 and 990-EZ filings, converted to spreadsheet form. Additional fields being added regularly. Details →

ISERV earth observationenvironmentalgeospatialsatellite imagery ISS SERVIR Environmental Research and Visualization System (ISERV) was a fully-automated prototype camera aboard the International Space Station that was tasked to capture high-resolution Earth imagery of specific locations at 3-7 frames per second. In the course of its regular operations during 2013 and 2014, ISERV's camera acquired images that can be used primaliry in use is environmental and disaster management. Details →

Image classification - fast.ai datasets computer visiondeep learningmachine learning Some of the most important datasets for image classification research, including CIFAR 10 and 100, Caltech 101, MNIST, Food-101, Oxford-102-Flowers, Oxford-IIIT-Pets, and Stanford-Cars. This is part of the fast.ai datasets collection hosted by AWS for convenience of fast.ai students. See documentation link for citation and license details for each dataset. Details →

Image localization - fast.ai datasets computer visiondeep learningmachine learning Some of the most important datasets for image localization research, including Camvid and PASCAL VOC (2007 and 2012). This is part of the fast.ai datasets collection hosted by AWS for convenience of fast.ai students. See documentation link for citation and license details for each dataset. Details →

KITTI Vision Benchmark Suite autonomous vehiclescomputer visiondeep learningmachine learningrobotics Dataset and benchmarks for computer vision research in the context of autonomous driving. The dataset has been recorded in and around the city of Karlsruhe, Germany using the mobile platform AnnieWay (VW station wagon) which has been equipped with several RGB and monochrome cameras, a Velodyne HDL 64 laser scanner as well as an accurate RTK corrected GPS/IMU localization unit. The dataset has been created for computer vision and machine learning research on stereo, optical flow, visual odometry, semantic segmentation, semantic instance segmentation, road segmentation, single image depth predic... Details →

Kepler Mission Data astronomy The Kepler mission observed the brightness of more than 180,000 stars near the Cygnus constellation at a 30 minute cadence for 4 years in order to find transiting exoplanets, study variable stars, and find eclipsing binaries. More information about the Kepler mission is available at MAST. Details →

Multimedia Commons computer visionmachine learningmultimedia The Multimedia Commons is a collection of audio and visual features computed for the nearly 100 million Creative Commons-licensed Flickr images and videos in the YFCC100M dataset from Yahoo! Labs, along with ground-truth annotations for selected subsets. The International Computer Science Institute (ICSI) and Lawrence Livermore National Laboratory are producing and distributing a core set of derived feature sets and annotations as part of an effort to enable large-scale video search capabilities. They have released this feature corpus into the public domain, under Creative Commons License 0, s... Details →

NLP - fast.ai datasets deep learningmachine learningnatural language processing Some of the most important datasets for NLP, with a focus on classification, including IMDb, AG-News, Amazon Reviews (polarity and full), Yelp Reviews (polarity and full), Dbpedia, Sogou News (Pinyin), Yahoo Answers, Wikitext 2 and Wikitext 103, and ACL-2010 French-English 10^9 corpus. This is part of the fast.ai datasets collection hosted by AWS for convenience of fast.ai students. See documentation link for citation and license details for each dataset. Details →

NOAA Emergency Response Imagery aerial imageryclimatedisaster responsesustainabilityweather In order to support NOAA's homeland security and emergency response requirements, the National Geodetic Survey Remote Sensing Division (NGS/RSD) has the capability to acquire and rapidly disseminate a variety of spatially-referenced datasets to federal, state, and local government agencies, as well as the general public. Remote sensing technologies used for these projects have included lidar, high-resolution digital cameras, a film-based RC-30 aerial camera system, and hyperspectral imagers. Examples of rapid response initiatives include acquiring high resolution images with the Emerge/App... Details →

NOAA Global Ensemble Forecast System (GEFS) climatemeteorologicalsustainabilityweather The Global Ensemble Forecast System (GEFS), previously known as the GFS Global ENSemble (GENS), is a weather forecast model made up of 21 separate forecasts, or ensemble members. The National Centers for Environmental Prediction (NCEP) started the GEFS to address the nature of uncertainty in weather observations, which is used to initialize weather forecast models. The GEFS attempts to quantify the amount of uncertainty in a forecast by generating an ensemble of multiple forecasts, each minutely different, or perturbed, from the original observations. With global coverage, GEFS is produced fo... Details →

NOAA Global Forecast System (GFS) climatedisaster responseenvironmentalmeteorologicalsustainabilityweather The Global Forecast System (GFS) is a weather forecast model produced by the National Centers for Environmental Prediction (NCEP). Dozens of atmospheric and land-soil variables are available through this dataset, from temperatures, winds, and precipitation to soil moisture and atmospheric ozone concentration. The entire globe is covered by the GFS at a base horizontal resolution of 18 miles (28 kilometers) between grid points, which is used by the operational forecasters who predict weather out to 16 days in the future. Horizontal resolution drops to 44 miles (70 kilometers) between grid point... Details →

NOAA Global Hydro Estimator (GHE) meteorologicalsustainabilitywaterweather Global Hydro-Estimator provides a global mosaic imagery of rainfall estimates from multi-geostationary satellites, which currently includes GOES-16, GOES-15, Meteosat-8, Meteosat-11 and Himawari-8. The GHE products include: Instantaneous rain rate, 1 hour, 3 hour, 6 hour, 24 hour and also multi-day rainfall accumulation. Details →

NOAA High-Resolution Rapid Refresh (HRRR) Model climatedisaster responseenvironmentalsustainabilityweather The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh. Details →

NOAA Integrated Surface Database (ISD) climatemeteorologicalsustainabilityweather The Integrated Surface Database (ISD) consists of global hourly and synoptic observations compiled from numerous sources into a gzipped fixed width format. ISD was developed as a joint activity within Asheville's Federal Climate Complex. The database includes over 35,000 stations worldwide, with some having data as far back as 1901, though the data show a substantial increase in volume in the 1940s and again in the early 1970s. Currently, there are over 14,000 "active" stations updated daily in the database. The total uncompressed data volume is around 600 gigabytes; however, it ... Details →

NOAA National Blend of Models (NBM) climatemeteorologicalsustainabilityweather The National Blend of Models (NBM) is a nationally consistent and skillful suite of calibrated forecast guidance based on a blend of both NWS and non-NWS numerical weather prediction model data and post-processed model guidance. The goal of the NBM is to create a highly accurate, skillful and consistent starting point for the gridded forecast. The most recent data is under the opnl and expr prefixes. A copy is also placed under the wmo prefix. The wmo prefix is structured like so: wmo/<parameter>/<year>/<month>/<day>/<wmo-file-name> The wmo filename codes can be d... Details →

NOAA National Water Model Short-Range Forecast agricultureclimatedisaster responseenvironmentalsustainabilitytransportationweather The National Water Model (NWM) is a water resources model that simulates and forecasts water budget variables, including snowpack, evapotranspiration, soil moisture and streamflow, over the entire continental United States (CONUS). The model, launched in August 2016, is designed to improve the ability of NOAA to meet the needs of its stakeholders (forecasters, emergency managers, reservoir operators, first responders, recreationists, farmers, barge operators, and ecosystem and floodplain managers) by providing expanded accuracy, detail, and frequency of water information. It is operated by NOA... Details →

NOAA Space Weather Forecast and Observation Data climatemeteorologicalsolarsustainabilityweather Space weather forecast and observation data is collected and disseminated by NOAA’s Space Weather Prediction Center (SWPC) in Boulder, CO. SWPC produces forecasts for multiple space weather phenomenon types and the resulting impacts to Earth and human activities. A variety of products are available that provide these forecast expectations, and their respective measurements, in formats that range from detailed technical forecast discussions to NOAA Scale values to simple bulletins that give information in laymen's terms. Forecasting is the prediction of future events, based on analysis and... Details →

Nanopore Reference Human Genome genomiclife sciences This dataset includes the sequencing and assembly of a reference standard human genome (GM12878) using the MinION nanopore sequencing instrument with the R9.4 1D chemistry. Details →

Open Observatory of Network Interference internet A free software, global observation network for detecting censorship, surveillance and traffic manipulation on the internet. Details →

OpenNeuro biologyimaginglife sciencesneurobiologyneuroimaging OpenNeuro is a database of openly-available brain imaging data. The data are shared according to a Creative Commons CC0 license, providing a broad range of brain imaging data to researchers and citizen scientists alike. The database primarily focuses on functional magnetic resonance imaging (fMRI) data, but also includes other imaging modalities including structural and diffusion MRI, electroencephalography (EEG), and magnetoencephalograpy (MEG). OpenfMRI is a project of the Center for Reproducible Neuroscience at Stanford University. Development of the OpenNeuro resource has been funded by th... Details →

OpenStreetMap Linear Referencing disaster responsegeospatialosmsustainabilitytraffic OSMLR a linear referencing system built on top of OpenStreetMap. OSM has great information about roads around the world and their interconnections, but it lacks the means to give a stable identifier to a stretch of roadway. OSMLR provides a stable set of numerical IDs for every 1 kilometer stretch of roadway around the world. In urban areas, OSMLR IDs are attached to each block of roadways between significant intersections. Details →

PROJ datum grids geospatialmapping Horizontal and vertical adjustment datasets for coordinate transformation to be used by PROJ 7 or later. PROJ is a generic coordinate transformation software that transforms geospatial coordinates from one coordinate reference system (CRS) to another. This includes cartographic projections as well as geodetic transformations. Details →

Physionet biologylife sciences PhysioNet offers free web access to large collections of recorded physiologic signals (PhysioBank) and related open-source software (PhysioToolkit). Details →

Provision of Web-Scale Parallel Corpora for Official European Languages (ParaCrawl) machine translationnatural language processing ParaCrawl is a set of large parallel corpora to/from English for all official EU languages by a broad web crawling effort. State-of-the-art methods are applied for the entire processing chain from identifying web sites with translated text all the way to collecting, cleaning and delivering parallel corpora that are ready as training data for CEF.AT and translation memories for DG Translation. Details →

Smithsonian Open Access artcultureencyclopedichistorymuseum The Smithsonian’s mission is the "increase and diffusion of knowledge" and has been collecting since 1846. The Smithsonian, through its efforts to digitize its multidisciplinary collections, has created millions of digital assets and related metadata describing the collection objects. On February 25th, 2020, the Smithsonian released over 2.8 million CC0 interdisciplinary 2-D and 3-D images, related metadata, and additionally, research data from researches across the Smithsonian. The 2.8 million "open access" collections are a subset of the Smithsonian’s 155 million objects,... Details →

Software Heritage Graph Dataset digital preservationfree softwareopen source softwaresource code Software Heritage is the largest existing public archive of software source code and accompanying development history. The Software Heritage Graph Dataset is a fully deduplicated Merkle DAG representation of the Software Heritage archive.The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Deb... Details →

Tabula Muris Senis biologyencyclopedicgenomichealthlife sciencesmachine learningmedicinesingle-cell transcriptomics Tabula Muris Senis is a comprehensive compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 500,000 cells from 18 organs and tissues across the mouse lifespan. We discovered cell-specific changes occurring across multiple cell types and organs, as well as age related changes in the cellular composition of different organs. Using single-cell transcriptomic data we were able to assess cell type specific manifestations of different hallmarks of aging, such as senescence, changes in the activity of metabolic pathways, depletion of stem-cell populat... Details →

The Genome Modeling System geneticgenomiclife sciences The Genome Institute at Washington University has developed a high-throughput, fault-tolerant analysis information management system called the Genome Modeling System (GMS), capable of executing complex, interdependent, and automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. GMS includes a full system image with software and services, expandable from one workstation to a large compute cluster. Details →

The Human Connectome Project life sciencesneuroimaging The Human Connectome Project aims to provide an unparalleled compilation of neural data, an interface to graphically navigate this data and the opportunity to achieve never before realized conclusions about the living human brain. Details →

The Massively Multilingual Image Dataset (MMID) computer visionmachine learningmachine translationnatural language processing MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania. The dataset is doubly parallel: for each language, words are stored parallel to images that represent the word, and parallel to the word's translation into English (and corresponding images.) Details →

The Multilingual Amazon Reviews Corpus machine learningnatural language processing We present a collection of Amazon reviews specifically designed to aid research in multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. 'books', 'appliances', etc.) Details →

stdpopsim species resources genetic mapslife sciencespopulation geneticsrecombination mapssimulations Contains all resources (genome specifications, recombination maps, etc.) required for species specific simulation with the stdpopsim package. These resources are originally from a variety of other consortium and published work but are consolidated here for ease of access and use. If you are interested in adding a new species to the stdpopsim resource please raise an issue on the stdpopsim GitHub page to have the necessary files added here. Details →

MODIS MYD13A1, MOD13A1, MYD11A1, MOD11A1, MCD43A4 disaster responsegeospatialnatural resourcesatellite imagerysustainability Data from the Moderate Resolution Imaging Spectroradiometer (MODIS), managed by the U.S. Geological Survey and NASA. Five products are included: MCD43A4 (MODIS/Terra and Aqua Nadir BRDF-Adjusted Reflectance Daily L3 Global 500 m SIN Grid), MOD11A1 (MODIS/Terra Land Surface Temperature/Emissivity Daily L3 Global 1 km SIN Grid), MYD11A1 (MODIS/Aqua Land Surface Temperature/Emissivity Daily L3 Global 1 km SIN Grid), MOD13A1 (MODIS/Terra Vegetation Indices 16-Day L3 Global 500 m SIN Grid), and MYD13A1 (MODIS/Aqua Vegetation Indices 16-Day L3 Global 500 m SIN Grid). MCD43A4 has global coverage, all... Details → Usage examples Astraea Earth OnDemand by Astraea, Inc. See 1 usage example →

COVID-19 Molecular Structure and Therapeutics Hub bioinformaticsbiologycoronavirusCOVID-19molecular dockingpharmaceutical Aggregating critical information to accelerate drug discovery for the molecular modeling and simulation community. A community-driven data repository and curation service for molecular structures, models, therapeutics, and simulations related to computational research related to therapeutic opportunities for COVID-19 (caused by the SARS-CoV-2 coronavirus). Details →

DigitalGlobe Open Data Program disaster responseearth observationgeospatialsatellite imagerysustainability Pre and post event high-resolution satellite imagery in support of emergency planning, risk assessment, monitoring of staging areas and emergency response, damage assessment, and recovery. Also incudes crowdsourced damage assessments for major, sudden onset disasters. Details →

Multiview Extended Video with Activities (MEVA) computer visionurbanus The Multiview Extended Video with Activities (MEVA) dataset consists video data of human activity, both scripted and unscripted, collected with roughly 100 actors over several weeks. The data was collected with 29 cameras with overlapping and non-overlapping fields of view. The current release consists of about 328 hours (516GB, 4259 clips) of video data, as well as 4.6 hours (26GB) of UAV data. Other data includes GPS tracks of actors, camera models, and a site map. We have also released annotations for 22 hours of data. Further updates are planned. Details →