Dataset Schema Generator

This free tool generates JSON-LD structured data for datasets using Schema.org's Dataset type. Enter your dataset's name, description, variables, temporal and geographic coverage, publisher, license, and download format, and the generator builds valid markup that makes your data discoverable through Google Dataset Search and standard web search. Get your data in front of the researchers, analysts, journalists, and developers who need it.

Dataset Details
0 characters Recommended: 50-110 chars
0 characters Recommended: 50-5000 chars
Tip: Be as specific as possible with the name. "US Monthly Employment Statistics by State, 2000-2025" is far more discoverable than "Employment Data."
Creator & Publisher
Dates
Coverage
Tip: Geographic filtering is one of the primary discovery mechanisms in Google Dataset Search. Always include spatialCoverage for location-specific data.
License
Variables Measured (Optional)
Add Variable
Tip: List the specific measurements in your dataset, such as "daily maximum temperature," "unemployment rate," or "GDP growth." This makes your data findable for variable-specific searches.
Distribution / Downloads (Recommended)
Add Download Format
Tip: Always include at least one distribution with a download URL and format. Users searching for data often filter by format. A single dataset can have multiple distributions (e.g. CSV, JSON, and Excel).
Keywords (Optional)
Additional Details (Optional)

Generated Dataset Schema (JSON-LD)

What Is Dataset Schema?

Dataset schema is structured data that describes a collection of data published on the web. It uses Schema.org's Dataset type to define what the data contains, who published it, when it was created, what time period and geography it covers, how it's licensed, and where to download it. The markup gives search engines a machine-readable catalog entry for your dataset, turning a file sitting on a server into a discoverable resource.

The concept borrows from the world of library science and data cataloging. Just as a library catalog tells you what a book is about without requiring you to read it, Dataset schema tells search engines what your data contains without requiring them to parse the files. A CSV with ten million rows of climate measurements is opaque to a search crawler. Dataset schema wrapped around that CSV explains that it contains daily temperature readings from 500 weather stations across North America from 1970 to 2024, published by NOAA, available under a Creative Commons license.

Google launched Dataset Search in 2018 as a dedicated search engine specifically for finding datasets. It relies heavily on Dataset schema to populate its index. Standard Google Search also processes Dataset schema and may display dataset-specific rich results for relevant queries. The markup is the primary mechanism by which your data becomes findable through both search products.

Who Publishes Datasets?

Dataset schema isn't just for academic researchers and government agencies, though those are the most obvious publishers. Any organization that makes structured data available to the public or to specific audiences benefits from marking it up.

  • Government and public sector. Census data, economic indicators, environmental monitoring, health statistics, crime data, transportation records, and regulatory filings.
  • Academic and research institutions. Study results, experimental data, survey responses, genome sequences, astronomical observations, and any data that supports published research.
  • News organizations. Data journalism relies on datasets, and many newsrooms publish the data underlying their reporting. Election results, economic analysis, public records compilations, and investigative databases.
  • Businesses and SaaS companies. Market research firms publishing industry benchmarks, financial data providers offering free tiers of their data, and any company that publishes data as a content marketing strategy.
  • Nonprofits and NGOs. Humanitarian data, conservation monitoring, public health surveillance, and development indicators.
  • Open data communities. Contributors to open data platforms, citizen science projects, and crowdsourced data collection efforts.

What Makes Google Dataset Search Different?

Google Dataset Search is a specialized vertical search engine, separate from standard Google Search, that indexes exclusively datasets. It launched as a research project and has become the primary discovery tool for finding data across the web.

  • Dedicated interface. Dataset Search has its own interface at datasetsearch.research.google.com with filters designed for data discovery: filtering by update date, file format, usage rights, and topic.
  • Schema-dependent indexing. While standard Google Search can index pages based on their text content regardless of schema, Dataset Search relies almost entirely on structured data to understand what a dataset contains. Without Dataset schema, your data pages are effectively invisible to Dataset Search.
  • Aggregation across repositories. Dataset Search indexes datasets from institutional repositories like Zenodo, Figshare, and Dryad alongside datasets hosted on individual websites, government portals, and data platforms.
  • Citation and reuse tracking. Dataset Search surfaces information about how many other datasets or publications reference a given dataset. Well-described datasets with clear provenance and licensing tend to get cited more.

What Properties Does the Generator Include?

The generator covers all properties Google recommends for Dataset schema, organized from essential to enriching.

  • name. The dataset's title. Be descriptive and specific so someone can evaluate relevance before reading the full description.
  • description. A detailed summary of what the dataset contains, how it was collected, what variables are included, and what time period and geography it covers. This is the single most important property for discoverability.
  • url. The canonical URL of the landing page where the dataset is described and accessed.
  • creator / publisher. The person or organization that created the dataset, and the entity that makes it available.
  • datePublished / dateModified. When the dataset was first published and when it was last updated.
  • license. The usage license. This is one of the most-used filters in Dataset Search.
  • distribution. How the dataset can be accessed. Each distribution specifies file format, download URL, and content type.
  • temporalCoverage. The time period the data spans, using ISO 8601 notation.
  • spatialCoverage. The geographic area the data covers.
  • variableMeasured. The specific variables or measurements contained in the dataset.
  • keywords. Topic keywords that supplement the description for discovery.
  • measurementTechnique. How the data was collected.

Can I Use This for API-Accessible Data?

Yes. Datasets don't have to be downloadable files. If your data is accessible through an API, you can still describe it with Dataset schema and make it discoverable through Dataset Search.

For API-accessible datasets, the distribution property uses a DataDownload entry with the contentUrl pointing to the API endpoint or documentation page, and the encodingFormat indicating the response format (typically "application/json" or "text/csv" depending on the API output).

You can include both API access and file downloads as separate distribution entries within the same Dataset schema. This gives users the option of downloading a static snapshot or accessing real-time data through the API, all described within a single schema block.

For datasets that are exclusively API-accessible with no static download, make sure the landing page explains how to access the API, what authentication is required, what endpoints are available, and what rate limits apply.

How Do I Handle Dataset Updates?

Many datasets are updated regularly, whether daily, monthly, quarterly, or annually. How you handle updates in your schema affects both discoverability and user trust.

  • Update dateModified with each release. Every time you publish new data, update the dateModified property. This tells search engines and users that the dataset is actively maintained.
  • Use temporalCoverage to reflect the current range. As your dataset grows with new data, expand the temporalCoverage end date to reflect the latest data available.
  • Version management. If you publish distinct versions, consider whether each version should have its own Dataset schema or whether the schema should always describe the current version.
  • Changelog documentation. If your updates change the dataset's structure, add new variables, remove variables, or change collection methodology, document these changes on the landing page.

Common Dataset Schema Mistakes to Avoid

  • Writing a vague description. "This dataset contains economic data" tells Google nothing useful. A description needs specifics: what variables, what geography, what time period, what methodology, what format.
  • Omitting the license. Datasets without a declared license are assumed to be "all rights reserved" by default. If you intend for your data to be used, specify a license.
  • Missing distribution information. A Dataset schema that describes the data but doesn't tell anyone how to access it is a catalog entry for a book that's not in the library.
  • Using Dataset schema for things that aren't datasets. A table in a blog post or a chart in a report doesn't make the page a dataset. Dataset schema should be used for actual data collections that can be downloaded or queried.
  • Not specifying the file format. Users often filter by format. Declare the exact format in each distribution entry.
  • Forgetting spatialCoverage on geographically bounded data. If your dataset covers a specific country, state, city, or region, include the spatialCoverage property.
  • Neglecting updates after the initial publish. If the dataset is actively maintained, the schema should reflect that ongoing activity.

How Should I Structure the Landing Page?

The page where your dataset lives matters as much as the schema that describes it. Google evaluates both the structured data and the page content.

  • Make the dataset the focus. The landing page should be primarily about the dataset, not a blog post that mentions data in passing.
  • Provide direct access to the data. Put download links prominently on the landing page. If access requires registration, make the process clear upfront.
  • Document the variables. A data dictionary or variable codebook on the landing page dramatically increases the dataset's usability and provides keyword-rich content.
  • Include a data preview. Showing the first few rows or summary statistics lets users evaluate whether the dataset matches their needs without downloading the full file.
  • State the license clearly. Display the license type prominently with a link to the full license text.

Let's Grow Your Business

Want some free consulting? Let’s hop on a call and talk about what we can do to help.