Introduction to predictsr • predictsr

The predictsr package accesses the PREDICTS database (Hudson et al, 2013) from within R, conveniently as a dataframe. It uses the Natural History Museum Data Portal to download the latest versions of the PREDICTS database and some related metadata.

The PREDICTS database comprises over 4 million measurements of species at sites across the world. The data are mostly collected from arthropods (mainly insects), and we cover about 2% of the species known by science. All major terrestrial plant, animal, and fungal groups are covered. There have been 2 public releases of PREDICTS data. The first was in 2016 and the second was in 2022; each consisted of about 3 million and 1 million records, respectively. We include both, as a single dataset, in this package.

To get started, let’s load in the database into R and poke around. To do so you will use the GetPredictsData function, which pulls in the data from the data portal. I’ll repeat the default option below, but you can take either 2016, 2022, or both as the extracts to fetch. It reads in both extracts into a single dataframe:

predicts <- predictsr::GetPredictsData(extract = c(2016, 2022))
str(predicts)

## 'data.frame':    4318808 obs. of  67 variables:
##  $ Source_ID                              : Factor w/ 708 levels "AD1_2001__Liow",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Reference                              : Factor w/ 605 levels "Aben et al. 2008",..: 250 250 250 250 250 250 250 250 250 250 ...
##  $ Study_number                           : int  1 1 1 1 1 2 2 2 2 2 ...
##  $ Study_name                             : Factor w/ 888 levels "1 Western Ghat",..: 748 748 748 748 748 749 749 749 749 749 ...
##  $ SS                                     : Factor w/ 993 levels "AD1_2001__Liow 1",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ Diversity_metric                       : Factor w/ 15 levels "abundance","biomass",..: 1 1 1 1 1 15 15 15 15 15 ...
##  $ Diversity_metric_unit                  : Factor w/ 29 levels "effort-corrected individuals",..: 6 6 6 6 6 18 18 18 18 18 ...
##  $ Diversity_metric_type                  : Factor w/ 3 levels "Abundance","Occurrence",..: 1 1 1 1 1 3 3 3 3 3 ...
##  $ Diversity_metric_is_effort_sensitive   : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
##  $ Diversity_metric_is_suitable_for_Chao  : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
##  $ Sampling_method                        : Factor w/ 73 levels "accoustic encounter transect",..: 64 64 64 64 64 64 64 64 64 64 ...
##  $ Sampling_effort_unit                   : Factor w/ 29 levels "catch","core",..: 21 21 21 21 21 21 21 21 21 21 ...
##  $ Study_common_taxon                     : Factor w/ 97 levels "","Acrididae",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Rank_of_study_common_taxon             : Factor w/ 9 levels "","Infraspecies",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Site_number                            : int  1 2 3 4 5 1 2 3 4 5 ...
##  $ Site_name                              : Factor w/ 33356 levels "0","003_HEM_2",..: 7452 20011 22227 14654 17746 7452 20011 22227 14654 17746 ...
##  $ Block                                  : Factor w/ 2129 levels "","0","01.07.2007",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ SSS                                    : Factor w/ 50032 levels "AD1_2001__Liow 1 1",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ SSB                                    : Factor w/ 6098 levels "AD1_2001__Liow 1 ",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ SSBS                                   : Factor w/ 50032 levels "AD1_2001__Liow 1  1",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Sample_start_earliest                  : Date, format: "1999-03-03" "1999-02-24" ...
##  $ Sample_end_latest                      : Date, format: "1999-06-30" "1999-06-24" ...
##  $ Sample_midpoint                        : Date, format: "1999-05-01" "1999-04-25" ...
##  $ Sample_date_resolution                 : Factor w/ 3 levels "day","month",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Max_linear_extent_metres               : num  3000 3000 3000 1800 2000 3000 3000 3000 1800 2000 ...
##  $ Habitat_patch_area_square_metres       : num  870000 5210000 7946000 415000 272000 ...
##  $ Sampling_effort                        : num  30 30 30 18 20 30 30 30 18 20 ...
##  $ Rescaled_sampling_effort               : num  1 1 1 0.6 0.667 ...
##  $ Habitat_as_described                   : Factor w/ 3155 levels "","010 WETLAND",..: 2141 2458 2458 2447 2432 2141 2458 2458 2447 2432 ...
##  $ Predominant_land_use                   : Factor w/ 10 levels "Primary vegetation",..: 1 4 4 3 3 1 4 4 3 3 ...
##  $ Source_for_predominant_land_use        : Factor w/ 4 levels "","Direct from publication / author",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Use_intensity                          : Factor w/ 4 levels "Minimal use",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Km_to_nearest_edge_of_habitat          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Years_since_fragmentation_or_conversion: num  NA 70 70 30 30 NA 70 70 30 30 ...
##  $ Transect_details                       : Factor w/ 506 levels "","100m","100m*2&300m*1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Coordinates_method                     : Factor w/ 2 levels "Direct from publication / author",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Longitude                              : num  104 104 104 104 104 ...
##  $ Latitude                               : num  1.35 1.35 1.39 1.33 1.28 ...
##  $ Country_distance_metres                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Country                                : Factor w/ 246 levels "Afghanistan",..: 199 199 199 199 199 199 199 199 199 199 ...
##  $ UN_subregion                           : Factor w/ 23 levels "Australia and New Zealand",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ UN_region                              : Factor w/ 6 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Ecoregion_distance_metres              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Ecoregion                              : Factor w/ 814 levels "Admiralty Islands Lowland Rain Forests",..: 548 548 548 548 548 548 548 548 548 548 ...
##  $ Biome                                  : Factor w/ 16 levels "Tundra","Boreal Forests/Taiga",..: 13 13 13 13 13 13 13 13 13 13 ...
##  $ Realm                                  : Factor w/ 8 levels "Afrotropic","Antarctic",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Hotspot                                : Factor w/ 36 levels "","Atlantic Forest",..: 32 32 32 32 32 32 32 32 32 32 ...
##  $ Wilderness_area                        : Factor w/ 6 levels "","Amazonia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Taxon_number                           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Taxon_name_entered                     : Factor w/ 60997 levels "1","10","100",..: 7029 7029 7029 7029 7029 7029 7029 7029 7029 7029 ...
##  $ Indication                             : Factor w/ 2134 levels "","???","aardvark",..: 995 995 995 995 995 995 995 995 995 995 ...
##  $ Parsed_name                            : Factor w/ 48443 levels "","a","A100_Eumenidae",..: 5498 5498 5498 5498 5498 5498 5498 5498 5498 5498 ...
##  $ Taxon                                  : Factor w/ 29585 levels "","Abacoproeces saltuum",..: 2167 2167 2167 2167 2167 2167 2167 2167 2167 2167 ...
##  $ COL_ID                                 : int  13025340 13025340 13025340 13025340 13025340 13025340 13025340 13025340 13025340 13025340 ...
##  $ Name_status                            : Factor w/ 3 levels "","accepted name",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Rank                                   : Factor w/ 9 levels "","Infraspecies",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Kingdom                                : Factor w/ 5 levels "","Animalia",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Phylum                                 : Factor w/ 14 levels "","Annelida",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Class                                  : Factor w/ 53 levels "","Actinopterygii",..: 25 25 25 25 25 25 25 25 25 25 ...
##  $ Order                                  : Factor w/ 291 levels "","Acarosporales",..: 128 128 128 128 128 128 128 128 128 128 ...
##  $ Family                                 : Factor w/ 1652 levels "","Abrocomidae",..: 98 98 98 98 98 98 98 98 98 98 ...
##  $ Genus                                  : Factor w/ 10606 levels "","Abacoproeces",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Species                                : Factor w/ 14093 levels "","abalinealis",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Best_guess_binomial                    : Factor w/ 36312 levels "","Abacoproeces saltuum",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Higher_taxon                           : Factor w/ 63 levels "","Actinopterygii",..: 26 26 26 26 26 26 26 26 26 26 ...
##  $ Measurement                            : num  42 242 232 111 63 5 22 5 9 10 ...
##  $ Effort_corrected_measurement           : num  42 242 232 185 94.5 5 22 5 9 10 ...

So we can see that there are over 4 million records in the combined PREDICTS extracts. Let’s look at a set of summary statistics for the database:

taxa <- predicts[
  !duplicated(predicts[, c("Source_ID", "Study_name", "Taxon_name_entered")]),
]
species_counts <- length(
  unique(taxa[taxa$Rank %in% c("Species", "Infraspecies"), "Taxon"])
) +  nrow(taxa[!taxa$Rank %in% c("Species", "Infraspecies"), ])

print(glue::glue(
  "This database has {length(unique(predicts$SS))} studies across ",
  "{length(unique(predicts$SSBS))} sites, in ",
  "{length(unique(predicts$Country))} countries, and with ",
  "{species_counts} species."
))

## This database has 817 studies across 35736 sites, in 101 countries, and with 54863 species.

So over 30,000 sites, 101 countries, and 12,892 species in this dataframe! There are a couple of important columns that should be noted. The SS column is what we use to identify studies when dealing with PREDICTS data. This is the concantenation of the Source_ID, and the Study_number columns. Another important identifier is the SBBS, the concatenation of Source_ID, Study_number, Block and Site_number, which is what we use to identify single sites in the database.

Let’s also check the ranges of sample collection in the database:

print(glue::glue(
  "Earliest sample collection (midpoint): {min(predicts$Sample_midpoint)}, ",
  "latest sample collection (midpoint): {max(predicts$Sample_midpoint)}"
))

## Earliest sample collection (midpoint): 1984-04-22, latest sample collection (midpoint): 2018-05-01

Accessing site-level summaries

We also include access to the site-level summaries from the full release; to get these data you will need to use the GetSitelevelSummaries function. The function call is very similar to pull in the summaries for the same data as above:

summaries <- predictsr::GetSitelevelSummaries(extract = c(2016, 2022))
str(summaries)

## 'data.frame':    35738 obs. of  50 variables:
##  $ Source_ID                              : Factor w/ 707 levels "AD1_2001__Liow",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Reference                              : Factor w/ 605 levels "Aben et al. 2008",..: 250 250 250 250 250 250 250 250 250 250 ...
##  $ Study_number                           : int  1 1 1 1 1 2 2 2 2 2 ...
##  $ Study_name                             : Factor w/ 888 levels "1 Western Ghat",..: 748 748 748 748 748 749 749 749 749 749 ...
##  $ SS                                     : Factor w/ 993 levels "AD1_2001__Liow 1",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ Diversity_metric                       : Factor w/ 15 levels "abundance","biomass",..: 1 1 1 1 1 15 15 15 15 15 ...
##  $ Diversity_metric_unit                  : Factor w/ 29 levels "effort-corrected individuals",..: 6 6 6 6 6 18 18 18 18 18 ...
##  $ Diversity_metric_type                  : Factor w/ 3 levels "Abundance","Occurrence",..: 1 1 1 1 1 3 3 3 3 3 ...
##  $ Diversity_metric_is_effort_sensitive   : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
##  $ Diversity_metric_is_suitable_for_Chao  : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
##  $ Sampling_method                        : Factor w/ 73 levels "accoustic encounter transect",..: 64 64 64 64 64 64 64 64 64 64 ...
##  $ Sampling_effort_unit                   : Factor w/ 29 levels "catch","core",..: 21 21 21 21 21 21 21 21 21 21 ...
##  $ Study_common_taxon                     : Factor w/ 97 levels "","Acrididae",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Rank_of_study_common_taxon             : Factor w/ 9 levels "","Infraspecies",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Site_number                            : int  1 2 3 4 5 1 2 3 4 5 ...
##  $ Site_name                              : Factor w/ 33601 levels "0","003_HEM_2",..: 7456 20074 22376 14658 17757 7456 20074 22376 14658 17757 ...
##  $ Block                                  : Factor w/ 2129 levels "","0","01.07.2007",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ SSS                                    : Factor w/ 53008 levels "AD1_2001__Liow 1 1",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ SSB                                    : Factor w/ 6113 levels "AD1_2001__Liow 1 ",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ SSBS                                   : Factor w/ 53008 levels "AD1_2001__Liow 1  1",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Sample_start_earliest                  : Date, format: "1999-03-03" "1999-02-24" ...
##  $ Sample_end_latest                      : Date, format: "1999-06-30" "1999-06-24" ...
##  $ Sample_midpoint                        : Date, format: "1999-05-01" "1999-04-25" ...
##  $ Sample_date_resolution                 : Factor w/ 3 levels "day","month",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Max_linear_extent_metres               : num  3000 3000 3000 1800 2000 3000 3000 3000 1800 2000 ...
##  $ Habitat_patch_area_square_metres       : num  870000 5210000 7946000 415000 272000 ...
##  $ Sampling_effort                        : num  30 30 30 18 20 30 30 30 18 20 ...
##  $ Rescaled_sampling_effort               : num  1 1 1 0.6 0.667 ...
##  $ Habitat_as_described                   : Factor w/ 3155 levels "","010 WETLAND",..: 2141 2458 2458 2447 2432 2141 2458 2458 2447 2432 ...
##  $ Predominant_land_use                   : Factor w/ 10 levels "Primary vegetation",..: 1 4 4 3 3 1 4 4 3 3 ...
##  $ Source_for_predominant_land_use        : Factor w/ 4 levels "","Direct from publication / author",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Use_intensity                          : Factor w/ 4 levels "Minimal use",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Km_to_nearest_edge_of_habitat          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Years_since_fragmentation_or_conversion: num  NA 70 70 30 30 NA 70 70 30 30 ...
##  $ Transect_details                       : Factor w/ 506 levels "","100m","100m*2&300m*1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Coordinates_method                     : Factor w/ 2 levels "Direct from publication / author",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Longitude                              : num  104 104 104 104 104 ...
##  $ Latitude                               : num  1.35 1.35 1.39 1.33 1.28 ...
##  $ Country_distance_metres                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Country                                : Factor w/ 246 levels "Afghanistan",..: 199 199 199 199 199 199 199 199 199 199 ...
##  $ UN_subregion                           : Factor w/ 23 levels "Australia and New Zealand",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ UN_region                              : Factor w/ 6 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Ecoregion_distance_metres              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Ecoregion                              : Factor w/ 814 levels "Admiralty Islands Lowland Rain Forests",..: 548 548 548 548 548 548 548 548 548 548 ...
##  $ Biome                                  : Factor w/ 16 levels "Tundra","Boreal Forests/Taiga",..: 13 13 13 13 13 13 13 13 13 13 ...
##  $ Realm                                  : Factor w/ 8 levels "Afrotropic","Antarctic",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Hotspot                                : Factor w/ 36 levels "","Atlantic Forest",..: 32 32 32 32 32 32 32 32 32 32 ...
##  $ Wilderness_area                        : Factor w/ 6 levels "","Amazonia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ N_samples                              : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Higher_taxa                            : chr  "Hymenoptera" "Hymenoptera" "Hymenoptera" "Hymenoptera" ...

Investigating the summary data closer we see that there are a number of missing columns between the two dataframes:

print(names(predicts)[!(names(predicts) %in% names(summaries))])

##  [1] "Taxon_number"                 "Taxon_name_entered"          
##  [3] "Indication"                   "Parsed_name"                 
##  [5] "Taxon"                        "COL_ID"                      
##  [7] "Name_status"                  "Rank"                        
##  [9] "Kingdom"                      "Phylum"                      
## [11] "Class"                        "Order"                       
## [13] "Family"                       "Genus"                       
## [15] "Species"                      "Best_guess_binomial"         
## [17] "Higher_taxon"                 "Measurement"                 
## [19] "Effort_corrected_measurement"

This is because all of the measurement-level data has been dropped from the dataframe. Now indeed we could try and replicate the creation of the summaries dataframe through some {dplyr} operations. These would be the following (roughly):

Note: I’m not sure on the summary statistics for this one, there seems to be some differences, and there are a couple sites that don’t match up.

summaries_rep <- predicts |>
  dplyr::mutate(
    Higher_taxa = paste(sort(unique(Higher_taxon)), collapse = ","),
    N_samples = length(Measurement),
    Rescaled_sampling_effort = mean(Rescaled_sampling_effort),
    .by = SSBS
  ) |>
  dplyr::select(
    dplyr::all_of(names(summaries))
  ) |>
  dplyr::distinct() |>
  dplyr::arrange(SSBS)

summaries_copy <- summaries |>
  subset(SSBS %in% summaries_rep$SSBS) |>
  dplyr::arrange(SSBS)

all.equal(summaries_copy, summaries_rep)

##  [1] "Attributes: < Component \"row.names\": 35719 string mismatches >"                                                            
##  [2] "Component \"Source_ID\": Attributes: < Component \"levels\": Lengths (707, 708) differ (string compare on first 707) >"      
##  [3] "Component \"Study_common_taxon\": 1905 string mismatches"                                                                    
##  [4] "Component \"Rank_of_study_common_taxon\": 1254 string mismatches"                                                            
##  [5] "Component \"Site_name\": Attributes: < Component \"levels\": Lengths (33601, 33356) differ (string compare on first 33356) >"
##  [6] "Component \"Site_name\": Attributes: < Component \"levels\": 27521 string mismatches >"                                      
##  [7] "Component \"SSS\": Attributes: < Component \"levels\": Lengths (53008, 50032) differ (string compare on first 50032) >"      
##  [8] "Component \"SSS\": Attributes: < Component \"levels\": 13977 string mismatches >"                                            
##  [9] "Component \"SSB\": Attributes: < Component \"levels\": Lengths (6113, 6098) differ (string compare on first 6098) >"         
## [10] "Component \"SSB\": Attributes: < Component \"levels\": 460 string mismatches >"                                              
## [11] "Component \"SSBS\": Attributes: < Component \"levels\": Lengths (53008, 50032) differ (string compare on first 50032) >"     
## [12] "Component \"SSBS\": Attributes: < Component \"levels\": 13977 string mismatches >"                                           
## [13] "Component \"Rescaled_sampling_effort\": Mean relative difference: 0.237503"                                                  
## [14] "Component \"Higher_taxa\": 8399 string mismatches"

Accessing descriptions of the columns in PREDICTS

As we’ve seen already, there are 67 (!) columns in the PREDICTS database, and within the NHM Data portal releases, we have included a description of the data that is used in each of these columns. You can access this via the GetColumnDescriptions function:

descriptions <- predictsr::GetColumnDescriptions()
str(descriptions)

## 'data.frame':    69 obs. of  8 variables:
##  $ Column                          : chr  "Source_ID" "Reference" "Study_number" "Study_name" ...
##  $ Applies_to                      : chr  "Data Source" "Data Source" "Study" "Study" ...
##  $ Site_extract                    : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Diversity_extract               : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Type                            : chr  "String" "String" "Integer" "String" ...
##  $ Value_guaranteed_to_be_non_empty: chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Notes                           : chr  "ID for the Data Source." "Reference for the Data Source in the main text." " " " " ...
##  $ Validation                      : chr  "Unique." " " "Between 1 and n for n Studies within Data Source. Unique within Source_ID." "Unique within Source_ID." ...

So this includes the Column name, the resolution of the Column (Applies_to), whether it is in the PREDICTS extract (Diversity_extract), whether it is in the site-level summaries (Site_extract), the datatype of the Column*, whether it is guaranteed to be nonempty, any additional Notes, and any information on the range of values that the Column may be expected to take (Validation).

Glossary

To clarify some of the PREDICTS jargon, we include the following table of definitions, from the dataframes we have worked with thus far:

Source_ID (String): ID for the Data Source.
SS (String): Concatenation of Source_ID and Study_number.
Diversity_metric_type (String): One of:
- Abundance
- Occurrence
- Species richness
Block (Integer): Within a Study either:
- Empty for all Sites
- Non-empty for all Sites and at least two different values among Sites
SSBS (String): Concatenation of Source_ID, Study_number, Block and Site_number.
Measurement (Number): The biodiversity measurement of the Taxon at the Site in the Study, in units of Diversity_metric_unit.

Notes: When we refer to a “study” we typically refer to it from the SS that identifies it. Referring to an “extract” simply refers to a PREDICTS database release, either from 2016 or 2022 (or both). For further complete documentation see the SI in Hudson et. al. (2017).

Notes

The NHM Data Portal API has no rate limits so be considerate with your requests. Make sure you save the data somewhere, or use a tool like targets to save you from re-running workflows.

References

Hudson, Lawrence N., et al. “The PREDICTS database: a global database of how local terrestrial biodiversity responds to human impacts.” Ecology and evolution 4.24 (2014): 4701-4735. <doi.org/10.1002/ece3.1303>.

Hudson, Lawrence N., et al. “The database of the PREDICTS (projecting responses of ecological diversity in changing terrestrial systems) project.” Ecology and evolution 7.1 (2017): 145-188 <doi.org/10.1002/ece3.2579>