Last updated: 2023-12-20
Checks: 7 0
Knit directory:
workflowr-policy-landscape/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220505)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version a30bb03. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .RData
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data/.DS_Store
Ignored: data/original_dataset_reproducibility_check/.DS_Store
Ignored: output/.DS_Store
Ignored: output/Figure_3B/.DS_Store
Ignored: output/created_datasets/.DS_Store
Untracked files:
Untracked: gutenbergr_0.2.3.tar.gz
Unstaged changes:
Modified: Policy_landscape_workflowr.R
Modified: data/original_dataset_reproducibility_check/original_cleaned_data.csv
Modified: data/original_dataset_reproducibility_check/original_dataset_words_stm_5topics.csv
Modified: output/Figure_3A/Figure_3A.png
Modified: output/created_datasets/cleaned_data.csv
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/1a_Data_preprocessing.Rmd
)
and HTML (docs/1a_Data_preprocessing.html
) files. If you’ve
configured a remote Git repository (see ?wflow_git_remote
),
click on the hyperlinks in the table below to view the files as they
were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | 5c836ab | zuzannazagrodzka | 2023-12-07 | Build site. |
html | c494066 | zuzannazagrodzka | 2023-12-02 | Build site. |
Rmd | 627fdff | zuzannazagrodzka | 2023-12-01 | correction number documents |
Rmd | 926693b | zuzannazagrodzka | 2023-12-01 | correction |
html | 8b3a598 | zuzannazagrodzka | 2023-11-10 | Build site. |
html | 729fc52 | zuzannazagrodzka | 2023-11-10 | Build site. |
html | a66c8a9 | zuzannazagrodzka | 2023-11-09 | Build site. |
Rmd | e8c5afe | zuzannazagrodzka | 2023-11-09 | wflow_publish(c("./analysis/ListMissionVision.Rmd", "./analysis/1b_Dictionaries_preparation.Rmd", |
Rmd | 41dd1ca | Thomas Frederick Johnson | 2022-11-25 | Revisions to the text, and pushing the write thing this time… |
html | 5bdfc2a | Andrew Beckerman | 2022-11-24 | Build site. |
html | 34ddc80 | Andrew Beckerman | 2022-11-24 | Build site. |
html | 93838e7 | Andrew Beckerman | 2022-11-24 | fixing paths in index |
html | a3f02dc | Andrew Beckerman | 2022-11-24 | fixing paths in index |
html | 693000e | Andrew Beckerman | 2022-11-24 | Build site. |
html | 60a6c61 | Andrew Beckerman | 2022-11-24 | Build site. |
html | fb90a00 | Andrew Beckerman | 2022-11-24 | Build site. |
Rmd | e08d7ac | Andrew Beckerman | 2022-11-24 | more organising and editing of workflowR mappings |
Rmd | 31239cd | Andrew Beckerman | 2022-11-24 | more organising and editing of workflowR mappings |
Rmd | c95aa82 | Andrew Beckerman | 2022-11-10 | updating pre-processing mission html for workflowr |
html | c95aa82 | Andrew Beckerman | 2022-11-10 | updating pre-processing mission html for workflowr |
html | 0a21152 | zuzannazagrodzka | 2022-09-21 | Build site. |
html | 796aa8e | zuzannazagrodzka | 2022-09-21 | Build site. |
html | 91d5fb6 | zuzannazagrodzka | 2022-09-20 | Build site. |
Rmd | e8852f1 | zuzannazagrodzka | 2022-09-20 | wflow_publish(c("analysis/1a_Data_preprocessing.Rmd", "analysis/1b_Dictionaries_preparation.Rmd")) |
We collected 129 mission and aim statements among six stakeholder groups involved in the ecology and evolutionary biology research landscape.
We used the Scimago Journal & Country Rank website (https://www.scimagojr.com/) to search for the journals with the highest impact value in 2020 (subject areas include: Environmental Science, Agricultural and Biological Sciences, Biochemistry, Genetics and Molecular Biology); all of which publish ecology and evolutionary biology research. We identified a combined 14 open access (OA) journals and 16 non-open access (non-OA) journals. We included some journals that were priori aware of, but were not on the list. This collection of journals included both learned society and non-society journals.
We identified publishers as the owner or production unit of the journals.
To find funders we searched in the “Acknowledgments” sections of some scientific articles published in 2019 and 2020 in high impact factor journals (OA and non-OA). We focused on finding funders from all continents, with a limit of three national funders per country. Moreover, we contacted some colleagues/colleges/universities outside of the UK, for information on the funding sources in their country.
We looked at the Data availability statements of articles published in 2019 and 2020 in high impact factor journals (OA and non-OA) and collected information on where the data and code were archived. Our list includes generalist repositories and subject specific repositories.
We identified societies based on the journals they own and by priori experience.
Advocates are a group of organisations that actively support or promote good quality and accessible research (open research). We considered different aspects of open research (open access, open data, open methods) when looking for these advocacy organisations. Most advocates are not exclusively supporting research in ecology and evolutionary biology.
In August 2021 we collected the Aims and Mission Statements on the official website of each stakeholder. We did not contact anyone associated with the stakeholders to request more information. If there was no separate section for the aim or mission statements, but text resembling these statements was contained within an “About” section, this was deemed acceptable. The text from these websites were manually copied and separately saved for each of the stakeholders (List of the organisations. The first line in the documents is a source website.
To analyse the content of the statements, we first preprocessed the documents following the cleaning process suggested in Maier et al. 2018 “Applying LDA Topic Modelling in Communication Research: Toward a Valid and Reliable Methodology”:
Importing all documents and converting them into a table. Columns: name - name of the stakeholder filename - name of the file (NameOfStakeholder_DocumentType) stakeholder - stakeholder group (here: advocates, funders, journals,for-profit publishers, not-for-profit publishers, repositories, societies) txt - text (Statements) doc_type - type of the document (Mission Statement or About)
Removing link formatting from the text (http:// and https:// links)
Separating text into sentences and keeping information on what document and stakeholder they belong to.
Tokenisation - creating a tidy text, converting tokens to lowercase, removing punctuation, deleting special characters
Removing stop-words, for this we used lexicons SMART and snowball in stop_words lexicon (library tidytext) and removing other not significant words like: numbering (ii, iii, iv, v), name of document type (aim, aims, mission…), name of the stakeholders (erc, nerc, wellcome)
Lemmatization (library lexicon) - converting words to their lemma form/lexeme (e.g., “contaminating” and “contamination” become “contaminate”) (Manning & Schütze, 2003, p. 132).
We worked on a relatively small number of documents and because of that we did not perform relative prunning (stripping very rare and extremely frequent word occurrences from the observed data).
Cleaning environment and loading R packages
rm(list=ls())
library(tidyverse)
library(purrr)
library(tidyr)
library(stringr)
library(tidytext)
# Additional libraries
library(quanteda)
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"pcorMatrix" of class "replValueSp"; definition not updated
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"pcorMatrix" of class "xMatrix"; definition not updated
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"pcorMatrix" of class "mMatrix"; definition not updated
library(quanteda.textplots)
library(quanteda.dictionaries)
library(tm)
library(topicmodels)
library(ggplot2)
library(dplyr)
library(wordcloud)
library(reshape2)
library(igraph)
library(ggraph)
library(stm)
library("kableExtra") # to create a table when converting to html
Impotring stakeholder statements (.txt format), compiling them into a list, and converting this list into a corpus
dirs <- list.dirs(path = "./data/mission_statements", recursive = FALSE)
getwd()
[1] "/Users/zuzannazagrodzka/Library/CloudStorage/GoogleDrive-z.zagrodzka@sheffield.ac.uk/My Drive/PhD_folder_laptop/11_2023_Accelerating_OR_agenda/workflowr-policy-landscape"
# List of files
files <- list()
for (i in 1:length(dirs)){files[[i]] <- list.files(path = dirs[i], pattern = ".txt", full.names = TRUE, recursive = FALSE)}
# files
use_files <- unlist(files)
dirs <- list.dirs(path = "./data/mission_statements", recursive = FALSE)
# dirs
files <- list()
# files
for (i in 1:length(dirs)){
files[[i]] <- list.files(path = dirs[i],
pattern = ".txt",
full.names = TRUE,
recursive = FALSE)}
# files
use_files <- unlist(files)
# use_files
# using purrr to generate a data frame of the corpuses
corpus_df <- map_df(use_files,
~ data_frame(txt = read_file(.x)) %>%
mutate(filename = basename(.x)))
Warning: `data_frame()` was deprecated in tibble 1.1.0.
ℹ Please use `tibble()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
corpus_df$txt <- iconv(corpus_df$txt, from = "ISO-8859-1", to = "UTF-8")
# removing encoded junk from the text column
corpus_df$txt <- gsub("[^[:print:]]", " ", corpus_df$txt)
Add metadata to the corpus clarifying which stakeholder and stakeholder group each statement belongs to
# create new columns: name, stakeholder
corpus_df$name <- corpus_df$filename
corpus_df <- corpus_df %>% separate(name, c("name","doc_type"), sep = "_")
corpus_df <- corpus_df %>% mutate_at("doc_type", str_replace, ".txt", "")
# creating a column: stakeholder
corpus_df$stakeholder <- corpus_df$name
# filling stakeholder column with the stakeholders' names
# Funders
corpus_df$stakeholder[corpus_df$stakeholder%in% c("CNPq", "Alexander von Humboldt Foundation", "Australian Research Council", "Chinese Academy of Sciences", "Conacyt", "CONICYT", "Consortium of African Funds for the Environment", "Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior", "CSIR South Africa", "Deutsche Forschungsgemeinschaft", "ERC", "FORMAS", "French National Centre for Scientific Research", "Helmholtz-Gemeinschaft", "JST", "Max Planck Society", "MOE China", "National Natural Science Foundation", "National Research Council Italy", "National Science Foundation", "NERC", "NRC Egypt", "NRF South Africa", "NSERC", "RSPB", "Russian Academy of Science", "Sea World Research and Rescue Foundation", "Spanish National Research Council", "The Daimler and Benz Foundation", "The French National Research Agency", "Wellcome")] <- "funders"
# Journals OA
corpus_df$stakeholder[corpus_df$stakeholder%in% c("Arctic, Antarctic, and Alpine Research", "Biogeosciences","Conservation Letters", "Diversity and Distributions", "Ecology and Evolution", "Ecology and Society", "eLifeJournal", "Evolution Letters", "Evolutionary Applications", "Frontiers in Ecology and Evolution", "Neobiota", "PeerJJournal", "Plos Biology", "Remote Sensing in Ecology and Conservation")] <- "journals_OA"
# Journals nonOA (including transitioning, hybrid and closed - last time checked August 2021)
corpus_df$stakeholder[corpus_df$stakeholder%in% c("BioSciences", "American Naturalist", "Annual Review of Ecology Evolution and Systematics", "Biological Conservation", "Conservation Biology", "Ecological Applications", "Ecology Letters", "Ecology", "Evolution", "Frontiers in Ecology and the Environment", "Global Change Biology", "Journal of Applied Ecology", "Nature Ecology and Evolution", "Philosophical Transactions of the Royal Society B", "Proceedings of the Royal Society B Biological Sciences", "Trends in Ecology & Evolution")] <- "journals_nonOA"
# Societies
corpus_df$stakeholder[corpus_df$stakeholder%in% c("BES", "ESEB", "RS", "SORTEE", "The Society for Conservation Biology", "The Zoological Society of London", "Society for the Study of Evolution", "Max Planck Society", "American Society of Naturalists", "British Ecological Society", "Ecological Society of America", "European Society for Evolutionary Biology", "National Academy of Sciences", "Australasian Evolution Society", "Ecological Society of Australia", "Royal Society Te Aparangi", "The Royal Society")] <- "societies"
# Repositories
corpus_df$stakeholder[corpus_df$stakeholder%in% c("Australian Antarctic Data Centre", "BCO-DMO", "DNA Databank of Japan", "Dryad", "European Bioinformatics Institute", "Figshare", "GBIF", "Harvard Dataverse", "KNB", "Marine Data Archive", "NCBI", "TERN", "World Data Center for Climate", "Zenodo", "EcoEvoRxiv", "bioRxiv", "OSF")] <- "repositories"
# Publishers non for profit and for profit
corpus_df$stakeholder[corpus_df$stakeholder%in% c("The University of Chicago Press", "Annual Reviews", "BioOne", "eLife", "Frontiers", "PLOS", "Resilience Alliance", "The Royal Society Publishing", "AIBS")] <- "publishers_nonProfit"
corpus_df$stakeholder[corpus_df$stakeholder%in% c("Cell Press", "Elsevier", "Springer Nature", "PeerJ", "Pensoft", "Wiley")] <- "publishers_Profit"
# Advocates - stakeholders promoting good research practices and Open Research agenda
corpus_df$stakeholder[corpus_df$stakeholder%in% c("Center for Open Science", "coalitionS", "CoData", "DataCite", "DOAJ", "Gitlab", "Peer Community In", "RDA", "Research Data Canada", "Africa Open Science and Hardware", "Amelica", "Bioline International", "Coko", "COPDESS", "FAIRsharing" , "FORCE11", "FOSTER" , "Free our knowledge", "Jisc", "Open Access Australasia", "Reference Center for Environmental Information", "Research4life" , "ROpenSci" , "SPARC" )] <- "advocates"
Creating corpus_df_website_info which is going to be used later to get a list of the websites
corpus_df_website_info <- corpus_df
Text cleaned and lemmatized. All stakeholder names are removed
# Cleaning the text from http:// and https:// links, removing numbers and "'s"
# remove http:// and https:// and www.
corpus_df$txt <- gsub("(s?)(f|ht)tp(s?)://\\S+\\b", " ", corpus_df$txt, useBytes = TRUE)
corpus_df$txt <- gsub("www.\\S+\\s*", "", corpus_df$txt, useBytes = TRUE)
# removing full names and phrases before tokenisation:
# change oa to open access and or to open research, for-profit and for profit to forprofit, no-profit
corpus_df$txt <- gsub(" F.A.I.R. ", " FAIR ", corpus_df$txt, useBytes = TRUE)
corpus_df$txt <- gsub(" OA ", " open access ", corpus_df$txt, useBytes = TRUE)
corpus_df$txt <- gsub(" OR ", " open research ", corpus_df$txt, useBytes = TRUE)
corpus_df$txt <- gsub(" OS ", " open science ", corpus_df$txt, useBytes = TRUE)
corpus_df$txt <- gsub(" OA ", " open access ", corpus_df$txt, useBytes = TRUE)
corpus_df$txt <- gsub("no-profit|not-for-profit|not for-profit|no profit", "nonprofit", corpus_df$txt,useBytes = TRUE)
corpus_df$txt <- gsub("for-profit|for profit", "forprofit", corpus_df$txt,useBytes = TRUE)
corpus_df$txt <- gsub("DOIs|dois|DOI", "doi", corpus_df$txt, useBytes = TRUE)
# removing email addresses @
corpus_df$txt <- gsub("\\S*@\\S*","",corpus_df$txt, useBytes = TRUE)
# removing names mentioned in the documents:
corpus_df$txt <- gsub("Marc Schiltz the President of Science Europe|Dr. Francesca Dominici|Kaiser Wilhelm|Harold Varmus|Patrick Brown|Michael Eisen|Adolph von Harnack|Harnack|Otto Hahn Medal|Albert Einstein|Robert-Jan Smits|Carl Folke|Lance Gunderson|Abraham Lincoln|Sewall Wright|Ruth Patric|Douglas Futuyama|Louis Agassiz at Harvard's Museum of Comparative Zoology|Charles Darwin|Isaac Newton|Rosalind Franklin|Theodosius Dobzhansky","",corpus_df$txt, useBytes = TRUE)
# removing all names (part 1)
corpus_df$txt <- gsub("General Conference of the United Nations Educational, Scientific and Cultural Organization|International Association of Scientific, Technical & Medical Publishers|Coordination for the Improvement of Higher Education Personnel (CAPES)|Jasper Loftus-Hills Young Investigator Award|Edward O. Wilson Naturalist Award|International Network for the Availability of Scientific Publications|United Nations Educational, Scientific and Cultural Organization|Office of Polar Programs at the U.S. National Science Foundation|National Commission for Scientific and Technological Research|Coalition for Publishing Data in the Earth and Space Sciences|Natural Sciences and Engineering Research Council of Canada|Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior|Catalogue of Australian Antarctic and Subantarctic Metadata|Open Reliable Transparent Ecology and Evolutionary biology|International Nucleotide Sequence Database Collaboration|United States Government's National Science Foundation|Proceedings of the Royal Society B Biological Sciences|National Charter of Ethics for the Research Profession|Consortium of African Funds for the Environment (CAFE)|Committee on Data of the International Science Council|South African National Biodiversity Institute (SANBI)|Scholarly Publishing and Academic Resources Coalition|Malawi Environmental Endowment Trust (MEET) in Malawi|National Council of Science and Technology (Conacyt)|Annual Review of Ecology Evolution and Systematics|the University of Chicago Press Journals Division|Philosophical Transactions of the Royal Society B|International Max Planck Research Schools (IMPRS)|the National Health and Medical Research Council|Australian Government’s Department of Innovation|Consortium of African Funds for the Environment|the National Competitive Grants Program (NCGP)|European Society of Evolutionary Biology|Research for Development and Innovation (ARDI)|National Institute of Standards and Technology|International Congress of Conservation Biology|French National Centre for Scientific Research|University of Chicago Press Journals Division|Study of Environmental Arctic Change (SEARCH)|South African National Biodiversity Institute|Reference Center on Environmental Information|Biological and Chemical Oceanography Sections|Open Access Envoy of the European Commission|National Natural Science Foundation of China|National Institutes of Health|Big Hairy Audacious Goal|Deutsche Zentren für Gesundheitsforschung|University of Colorado Boulder|Study of Environmental Arctic Change (SEARCH)|John Maynard Smith|Darwin Core|PeerJ – the Journal of Life & Environmental Sciences (PeerJ)|PeerJ Computer Science|PeerJ Physical Chemistry|PeerJ Organic Chemistry|PeerJ Inorganic Chemistry|PeerJ Analytical Chemistry and PeerJ Materials Science", "", corpus_df$txt, useBytes = TRUE)
# removing all names (part 2)
corpus_df$txt <- gsub("African Institute of Open Science & Hardware|Electronic Publishing Trust for Development|Remote Sensing in Ecology and Conservation|National Competitive Grants Program (NCGP)|Journal of Biogeography and Global Ecology|Excellence in Research for Australia (ERA)|Excellence in Research for Australia (ERA)|Intergovernmental Panel on Climate Change|Gottlieb Daimler and Karl Benz Foundation|Carl Benz House|European Society for Evolutionary Biology|Sea World Research and Rescue Foundation|Science for Nature and People Parnership|Global Biodiversity Information Facility|Frontiers in Ecology and the Environment|EMBL's European Bioinformatics Institute|Artificial Intelligence Review Assistant|Institute of Arctic and Alpine Research|State of Florida and Palm Beach County|Peer Community in Evolutionary Biology|European Group on Biological Invasions|Arctic, Antarctic, and Alpine Research|Weizmann Institute in Rehovot, Israel|UNESCO Universal Copyright Convention|UNESCO Recommendation on Open Science|International Panel on Climate Change|European Molecular Biology Laboratory|European Molecular Biology Laboratory|University of Toronto at Scarborough|Natural Environment Research Council|Knut and Alice Wallenberg Foundation|Global Open Science Hardware Roadmap|State of Alaska's Salmon and People|Research for Global Justice (GOALI)|National Natural Science Foundation|Knowledge Network for Biocomplexity|Society for the Study of Evolution|Research in the Environment (OARE)|Frontiers in Ecology and Evolution|Data Observation Network for Earth|Collaborative Peer Review Platform|the American Journal of Sociology|Spanish National Research Council|Research Ideas and Outcomes (RIO)|Research Ideas and Outcomes (RIO)|European Bioinformatics Institute|Directory of Open Access Journals|Cambridge Conservation Initiative|Alexander von Humboldt Foundation|the Zoological Society of London|Society for Conservation Biology|Open Educational Resources (OER)|Field Chief Editor Mark A. Elgar|Biogeosciences Discussions (BGD)|Australian Antarctic Data Centre|University of Toronto Libraries|The University of Chicago Press|Research in Agriculture (AGORA)|NIH Intramural Research Program|National Research Council|National Academy of Engineering|Millennium Ecosystem Assessment|Journal of Evolutionary Biology|Howard Hughes Medical Institute|German Climate Computing Centre|French National Research Agency|European Research Council (ERC)|eLife Sciences Publications Ltd|Ecological Society of Australia|Deutsche Forschungsgemeinschaft|American Society of Naturalists|Japan's Science and Technology|Australian Government Minister|Australasian Evolution Society|African Journals OnLine (AJOL)|Africa Open Science & Hardware|World Data Center for Climate|Trends in Ecology & Evolution|National Institutes of Health|Kurchatov Institute in Russia|International Science Council|Elsevier’s Clinical Solutions|Ecological Society of America|Department of Social Sciences|Cornell and Yale Universities|Cold Spring Harbor Laboratory|American Journal of Sociology|Research for Health (Hinari)|Philosophical Transactions B|Nature Ecology and Evolution|National Research Foundation|National Library of Medicine|National Academy of Sciences|National Academy of Medicine|Journal of Political Economy|Journal of Political Economy|Helmholtz-Alberta Initiative|Harvard Dataverse Repository|European Research Area (ERA)|ISI ScienceWatch|Royal Charter|Springer Nature|The Nature Portfolio|Scientific American", "", corpus_df$txt, useBytes = TRUE)
# removing all names (part 3)
corpus_df$txt <- gsub("University of Chicago Press|Tropical Database in Brazil|Research Ideas and Outcomes|National Science Foundation|Ministry of Education (MEC)|Federal Republic of Germany|Diversity and Distributions|Daimler and Benz Foundation|Chinese Academy of Sciences|Chinese Academy of Sciences|Australian Research Council|Australia’s Chief Scientist|Russian Academy of Science|Nature Ecology & Evolution|National Research Strategy|Max Planck Innovation GmbH|Journal of Applied Ecology|Further Max Planck Centers|British Ecological Society|WHO, FAO, UNEP, WIPO, ILO|Royal Society Te Aparangi|Peer Community in Ecology|National Research Council|Evolutionary Applications|European Research Council|Environmental Funds|EFs|Biodiversity Data Journal|Biodiversity Data Journal|Royal Society Publishing|Dryad Digital Repository|Digital Editorial Office|Data Distribution Centre|Comparative Cytogenetics|Comparative Cytogenetics|American Biology Teacher|University of Melbourne|Public Research Centers|International Data Week|Ecological Applications|Ecological Applications|Center for Open Science|Biological Conservation|African Journals OnLine|African Journals OnLine|Wellcome Genome Campus|Research Data Alliance|Kaiser Wilhelm Society|Helmholtz-Gemeinschaft|Deutscher Wetterdienst|BirdLife international|Swedish Energy Agency|Social Service Review|Senator Claude Pepper|Ministry of Education|Institute of Medicine|Helmholtz Association|Helmholtz Association|Global Change Biology|Ecology and Evolution|DNA Databank of Japan|Congress of the Union|Bioline International|Bioline|Australian Government|ARC Discovery Program|Research Data Canada|Conservation Letters|Conservation Biology|Brazilian Federation|Big Garden Birdwatch|Albatross Task Force|Resilience Alliance|Nature Conservation|Nature Conservation|Marine Data Archive|European Commission|European Commission|Environmental Funds|Environmental Funds|Ecology and Society|Clarivate Analytics|American Naturalist|Russian Federation|Publication Ethics|Max Planck Society|Max Planck Society|Give Nature a Home|Free Our Knowledge|Fraunhofer Society|Peer Community In|Harvard Dataverse|Evolution Letters|Ecology & Society|CSIR South Africa|Bertha Benz Prize|United Utilities|Carl Benz House|NRF South Africa|Nature Portfolio|Helmholtz Senate|Ecology Letters|Daimler-Benz AG|CSIRO Australia|Colorado alpine|BioOne Complete|BioOne|HAMAGUCHI Plan|Gray's Anatomy|Biogeosciences|Annual Reviews|ZSL Whipsnade|ScienceDirect|ScienceDirect|Royal Society|Research4Life|PCI Evol Biol|Mexican State|GCB Bioenergy|Cell Symposia|Bose-Einstein|Plos Biology|Humboldtians|Humboldt|Horizon 2020|Google Drive|Future Earth|Biogeography|WDC-Climate|the Academy|Kichstarter|Humboldtian|FOSTER Plus|FAIRsharing|ELIXIR Node|cOAlition S|ZSL London|SciDataCon|Max Planck|Figure 360|EcoEvoRxiv|Daimler AG|CU-Boulder|Cell Press|Africa OSH|Sea World|PhytoKeys|NRC Egypt|MOE China|Frontiers|Evolution|Elseviere|CiteScore|Wellcome|rOpenSci|PCI Ecol|OpenAIRE|CU-Boulder |Neobiota|NeoBiota|MycoKeys|HUPO PSI|Figshare|EMBL-EBI|Elsevier|DataCite|ZooKeys|RESTful|Redalyc|Pensoft|FORCE11|Figshare|figshare|Ecology|Dropbox|DataONE|Conacyt|COMBINE|bioRxiv|AmeliCA|Zenodo|Plan S|Lancet|Gitlab|GitLab|Git|FORMAS|CoData|CODATA|Wiley|PeerJ|Inter|eLife|Dryad|Coko|CNPq|Cell |Hinari|Pronaces|Cnr|Vinnova|Minerva|uGREAT|Benz|GitHub|protocols.io|Andrea Stephens|Mtauranga|Metacat|ELIXIR|VSNU and the UKB|Springer|Nikau Consultancy|Aspiration", "", corpus_df$txt, useBytes = TRUE)
# removing all names (part 4)
corpus_df$txt <- gsub("Washington Watch|BioScience|Eye on Education|AIBS Bulletin|Dr. Francesca Dominici|PeerJ – the Journal of Life & Environmental Sciences (PeerJ)|PeerJ Computer Science|PeerJ Physical Chemistry|PeerJ Organic Chemistry|PeerJ Inorganic Chemistry|PeerJ Analytical Chemistry and PeerJ Materials Science", "", corpus_df$txt, useBytes = TRUE)
# removing words related to the locations and names
corpus_df$txt <- gsub("Global South|Global North|New Zealanders|New Zelanders|New Zeland|New Zealand|Great Britain|North America|Eastern Europe|South America|South africans|South africa|Eastern Europe|ARPHA Platform|Woods Hole Oceanographic Institution|US JGOFS|US GLOBEC|NSF Geosciences Directorate (GEO) Division of Ocean Sciences (OCE) Biological and Chemical Oceanography Sections, Division of Polar Programs (PLR) Antarctic Sciences (ANT) Organisms & Ecosystems, and Arctic Sciences (ARC) awards|(DACST)|(CSD)|(FRD)|GBIF.org","",corpus_df$txt, useBytes = TRUE)
# removing abbreviations and other missed words
corpus_df$txt <- gsub("(CREDIT)|BCO-DMO|CONICYT|NEOBIOTA|INSTAAR|COPDESS|CLOCKSS|CoESRA|CAASM|AADC|CONZUL|EMPSEB|SHaRED|SORTEE|SEARCH|SANBI|SPARC|INSTAAR|UNESCO|APEC|AOASG|ARPHA|NCEAS|ICPSR|IMPRS|CMIP5|JDAP|CERN|MBMG|INASP|NSERC|GOALI|AIRA|AJOL|APIs|EMBL|AIBS|CAUL|CRIA|DOAJ|ICBB|ESEB|GBIF|K-12|NCBI|NCGP|NERC|IPCC|CNRS|CSIC|CSIR|BEIS|OARE|HSRC|PLOS|AAAR|USGS|NCAR|NOAA|NEON|ARDI|RSPB|DDBJ|INSDC|INSD|STAR|TERN|TREE|UTSC|UKRI|ARC|BES|SSE|COS|CAS|CTFs|DDI|EPT|ERC|ERA|JST|KNB|NRF|DFG|MDA|NIH|NLM|NRC|NRF|OSF|SCB|OSH|OAI|OCE|PCB|PCI|RDA|GCB|RDC|NSF|BGD|BMC|BHAG|ESA|ZSL|SPP|RCC|RMB|TRL|API|ARC|PLR|DDC|DKRZ|DWD|DVCS|NAE|NAM|EBI|ANR|API|NAS|ASN|NSF|OCE|ANT|UIs|API|EiC|TEE|UCL|SDGs|PIA|CL|RA|RS|STI|SNI|BG|U.K.|U.S.|EC|SC|CU|R&D|Eos|EIDs","",corpus_df$txt, useBytes = TRUE)
# removing numbers
corpus_df$txt <- gsub("[0-9]+","",corpus_df$txt, useBytes = TRUE)
# removing "'s"
corpus_df$txt <- gsub("'s","",corpus_df$txt, useBytes = TRUE)
# Replace [^a-zA-Z0-9 -] with an empty string.
corpus_df$txt <- gsub("[^a-zA-Z0-9 -]", "",corpus_df$txt, useBytes = TRUE)
Each statements sentences tokenised. Stop words identified and removed
# Tokenisation - creating a tidy text: it convert tokens to lowercase, removes punctuation
# Starting with tokenizing text into sentences:
corpus_df$txt_copy <- corpus_df$txt
# library(stringi)
# corpus_df$txt_copy <- stri_enc_toutf8(corpus_df$txt)
data_tidy_sentences <- corpus_df %>%
unnest_tokens(sentence, txt_copy, token = "sentences")
data_tidy_sentences <- data_tidy_sentences %>% group_by(name) %>% mutate(sentence_id = row_number())
data_tidy_sentences$sentence_doc <- paste0(data_tidy_sentences$name, "_", data_tidy_sentences$sentence_id)
colnames(data_tidy_sentences)
[1] "txt" "filename" "name" "doc_type" "stakeholder"
[6] "sentence" "sentence_id" "sentence_doc"
data_tidy_sentences <- as.data.frame(data_tidy_sentences)
data_tidy <- data_tidy_sentences %>%
# mutate(as.character(sentence)) %>%
unnest_tokens(word, sentence, token = "words" ) %>%
select(-sentence_id)
# Removal of stop-words: check the lexicons in stop_words, create a list of my stop words like: numbering (ii, iii, iv, v), name of document type (aim, aims, mission...), name of the stakeholders (erc, nerc, wellcome)
# onix lexicon contains words like "open", "opened" and so on, I decided to remove this lexicon from the analysis
my_stop_words <- stop_words %>%
filter(!grepl("onix", lexicon))
# removing other words (names of stakeholders, types of documents, months, abbreviations and not meaning anything)
my_stop_words <- bind_rows(data_frame(word = c("e.g", "i.e", "ii", "iii", "iv", "v", "vi", "vii", "ix", "x", "", "missions", "mission", "aims", "aimed", "aim", "values", "value", "vision", "about", "publisher", "funder", "society", "journal", "repository", "deutsche", "january", "febuary", "march", "april", "may", "june", "july", "august", "september", "october", "november", "december", "jan", "feb", "mar", "apr", "jun", "jul", "aug", "sep", "sept", "oct", "nov", "dec", "australasian", "australians", "australian", "australia", "latin", "america", "cameroon", "yaoundé", "berlin", "baden", "london", "whipsnade", "san", "francisco", "britain", "european", "europe", "malawi", "sweden", "florida", "shanghai", "argentina", "india", "florida", "luxembourg", "italy", "canadians", "canadian", "canada", "spanish", "spain", "france", "french", "antarctica", "antarctic", "paris", "cambridge", "harvard", "russian", "russia", "chicago", "colorado", "africans", "african", "africa", "japan", "japanese", "brazil", "zelanders", "zeland", "mori", "aotearoa", "american", "america", "australasia", "hamburg", "netherlands", "berlin", "china", "chinese", "brazil", "mexico", "germany", "german", "ladenburg", "baden", "potsdam", "platz", "oxford", "berlin", "asia", "budapest", "taiwan", "chile", "putonghua", "hong", "kong","helmholtz", "bremen", "copenhagen", "stuttgart", "hinxton", "mātauranga", "māori", "yaound", "egypt", "uk", "usa", "eu", "st", "miraikan", "makao", "billion", "billions", "eight", "eighteen", "eighty", "eleven", "fifteen", "fifty", "five", "forty", "four", "fourteen", "hundreds", "million", "millions", "nine", "nineteen", "ninety", "one", "ones", "seven", "seventeen", "seventy", "six", "sixteen", "sixty", "ten", "tens", "thirteen", "thirty", "thousand", "thousands", "three", "twelve", "twenty", "two", "iccb", "ca"), lexicon = c("custom")), my_stop_words)
data_tidy <- data_tidy %>%
anti_join(my_stop_words)
Joining with `by = join_by(word)`
# lemmatizing using lemma table
token_words <- tokens(data_tidy$word, remove_punct = TRUE)
tw_out <- tokens_replace(token_words,
pattern = lexicon::hash_lemmas$token,
replacement = lexicon::hash_lemmas$lemma)
tw_out_df<- as.data.frame(unlist(tw_out))
data_tidy <- cbind(data_tidy, tw_out_df$"unlist(tw_out)")
colnames(data_tidy)[which(names(data_tidy) == "word")] <- "orig_word"
colnames(data_tidy)[which(names(data_tidy) == "tw_out_df$\"unlist(tw_out)\"")] <- "word_mix"
# changing American English to British English
ukus_out <- tokens(data_tidy$word_mix, remove_punct = TRUE)
ukus_out <- quanteda::tokens_lookup(ukus_out, data_dictionary_us2uk, exclusive = FALSE, capkeys = FALSE)
ukus_df <- as.data.frame(unlist(ukus_out))
data_tidy <- cbind(data_tidy, ukus_df$"unlist(ukus_out)")
colnames(data_tidy)[which(names(data_tidy) == "ukus_df$\"unlist(ukus_out)\"")] <- "word"
Creating a column that will include info about OA and nonOA journals or publisher for profit and non-profit
data_words <- data_tidy
# Creating a column that will include info about OA and nonOA journals or publisher for profit and non-profit
data_words$org_subgroups <- data_words$stakeholder
data_words$stakeholder[data_words$stakeholder%in% c("journals_OA", "journals_nonOA" )] <- "journals"
data_words$stakeholder[data_words$stakeholder%in% c("publishers_Profit", "publishers_nonProfit" )] <- "publishers"
Information and a table with the number of documents per stakeholder and list of documents
# Number of documents per stakeholder
number_of_documents <- data_tidy %>%
select(name, stakeholder) %>%
distinct(name, .keep_all = TRUE) %>%
group_by(stakeholder) %>%
count(stakeholder)
# Table with a number of documents per stakeholder group
number_of_documents %>%
kbl(caption = "Number of documents per stakeholder group") %>%
kable_classic("hover", full_width = F)
stakeholder | n |
---|---|
advocates | 24 |
funders | 30 |
journals_OA | 14 |
journals_nonOA | 16 |
publishers_Profit | 6 |
publishers_nonProfit | 9 |
repositories | 17 |
societies | 13 |
# Creating a table with a source links of the statements
info <- corpus_df_website_info %>%
select(txt, filename, name, stakeholder)
info$stakeholder_more <- info$stakeholder
info$stakeholder[info$stakeholder%in% c("journals_OA", "journals_nonOA" )] <- "journals"
info$stakeholder[info$stakeholder%in% c("publishers_Profit", "publishers_nonProfit" )] <- "publishers"
# source links of the websites
source_website <- info$website <- word(info$txt, 1)
website_info_table <- info %>%
select(stakeholder, website)
website_info_table %>%
kbl(caption = "Source websites of the statements") %>%
kable_paper("hover", full_width = F)
# This data will be used in 2_Topic_Modeling, 4_Language_analysis
write_csv(data_words, "./output/created_datasets/cleaned_data.csv")
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/London
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] kableExtra_1.3.4 stm_1.3.6.1
[3] ggraph_2.1.0 igraph_1.5.1
[5] reshape2_1.4.4 wordcloud_2.6
[7] RColorBrewer_1.1-3 topicmodels_0.2-14
[9] tm_0.7-11 NLP_0.2-1
[11] quanteda.dictionaries_0.4 quanteda.textplots_0.94.3
[13] quanteda_3.3.1 tidytext_0.4.1
[15] lubridate_1.9.3 forcats_1.0.0
[17] stringr_1.5.0 dplyr_1.1.3
[19] purrr_1.0.2 readr_2.1.4
[21] tidyr_1.3.0 tibble_3.2.1
[23] ggplot2_3.4.3 tidyverse_2.0.0
[25] workflowr_1.7.1
loaded via a namespace (and not attached):
[1] gridExtra_2.3 rlang_1.1.1 magrittr_2.0.3 git2r_0.32.0
[5] compiler_4.3.1 getPass_0.2-2 systemfonts_1.0.4 callr_3.7.3
[9] vctrs_0.6.3 rvest_1.0.3 pkgconfig_2.0.3 crayon_1.5.2
[13] fastmap_1.1.1 utf8_1.2.3 promises_1.2.1 rmarkdown_2.25
[17] tzdb_0.4.0 ps_1.7.5 bit_4.0.5 xfun_0.40
[21] modeltools_0.2-23 cachem_1.0.8 jsonlite_1.8.7 highr_0.10
[25] SnowballC_0.7.1 later_1.3.1 tweenr_2.0.2 syuzhet_1.0.7
[29] parallel_4.3.1 stopwords_2.3 R6_2.5.1 bslib_0.5.1
[33] stringi_1.7.12 jquerylib_0.1.4 Rcpp_1.0.11 knitr_1.44
[37] httpuv_1.6.11 Matrix_1.5-4.1 timechange_0.2.0 tidyselect_1.2.0
[41] rstudioapi_0.15.0 yaml_2.3.7 viridis_0.6.4 processx_3.8.2
[45] lattice_0.21-8 plyr_1.8.9 withr_2.5.1 evaluate_0.21
[49] RcppParallel_5.1.7 polyclip_1.10-6 xml2_1.3.5 pillar_1.9.0
[53] lexicon_1.2.1 janeaustenr_1.0.0 whisker_0.4.1 stats4_4.3.1
[57] generics_0.1.3 vroom_1.6.3 rprojroot_2.0.3 hms_1.1.3
[61] munsell_0.5.0 scales_1.2.1 glue_1.6.2 slam_0.1-50
[65] tools_4.3.1 data.table_1.14.8 tokenizers_0.3.0 webshot_0.5.5
[69] fs_1.6.3 graphlayouts_1.0.2 fastmatch_1.1-4 tidygraph_1.2.3
[73] grid_4.3.1 colorspace_2.1-0 ggforce_0.4.1 cli_3.6.1
[77] fansi_1.0.4 viridisLite_0.4.2 svglite_2.1.2 gtable_0.3.4
[81] sass_0.4.7 digest_0.6.33 ggrepel_0.9.4 farver_2.1.1
[85] htmltools_0.5.6 lifecycle_1.0.3 httr_1.4.7 bit64_4.0.5
[89] MASS_7.3-60
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/London
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] kableExtra_1.3.4 stm_1.3.6.1
[3] ggraph_2.1.0 igraph_1.5.1
[5] reshape2_1.4.4 wordcloud_2.6
[7] RColorBrewer_1.1-3 topicmodels_0.2-14
[9] tm_0.7-11 NLP_0.2-1
[11] quanteda.dictionaries_0.4 quanteda.textplots_0.94.3
[13] quanteda_3.3.1 tidytext_0.4.1
[15] lubridate_1.9.3 forcats_1.0.0
[17] stringr_1.5.0 dplyr_1.1.3
[19] purrr_1.0.2 readr_2.1.4
[21] tidyr_1.3.0 tibble_3.2.1
[23] ggplot2_3.4.3 tidyverse_2.0.0
[25] workflowr_1.7.1
loaded via a namespace (and not attached):
[1] gridExtra_2.3 rlang_1.1.1 magrittr_2.0.3 git2r_0.32.0
[5] compiler_4.3.1 getPass_0.2-2 systemfonts_1.0.4 callr_3.7.3
[9] vctrs_0.6.3 rvest_1.0.3 pkgconfig_2.0.3 crayon_1.5.2
[13] fastmap_1.1.1 utf8_1.2.3 promises_1.2.1 rmarkdown_2.25
[17] tzdb_0.4.0 ps_1.7.5 bit_4.0.5 xfun_0.40
[21] modeltools_0.2-23 cachem_1.0.8 jsonlite_1.8.7 highr_0.10
[25] SnowballC_0.7.1 later_1.3.1 tweenr_2.0.2 syuzhet_1.0.7
[29] parallel_4.3.1 stopwords_2.3 R6_2.5.1 bslib_0.5.1
[33] stringi_1.7.12 jquerylib_0.1.4 Rcpp_1.0.11 knitr_1.44
[37] httpuv_1.6.11 Matrix_1.5-4.1 timechange_0.2.0 tidyselect_1.2.0
[41] rstudioapi_0.15.0 yaml_2.3.7 viridis_0.6.4 processx_3.8.2
[45] lattice_0.21-8 plyr_1.8.9 withr_2.5.1 evaluate_0.21
[49] RcppParallel_5.1.7 polyclip_1.10-6 xml2_1.3.5 pillar_1.9.0
[53] lexicon_1.2.1 janeaustenr_1.0.0 whisker_0.4.1 stats4_4.3.1
[57] generics_0.1.3 vroom_1.6.3 rprojroot_2.0.3 hms_1.1.3
[61] munsell_0.5.0 scales_1.2.1 glue_1.6.2 slam_0.1-50
[65] tools_4.3.1 data.table_1.14.8 tokenizers_0.3.0 webshot_0.5.5
[69] fs_1.6.3 graphlayouts_1.0.2 fastmatch_1.1-4 tidygraph_1.2.3
[73] grid_4.3.1 colorspace_2.1-0 ggforce_0.4.1 cli_3.6.1
[77] fansi_1.0.4 viridisLite_0.4.2 svglite_2.1.2 gtable_0.3.4
[81] sass_0.4.7 digest_0.6.33 ggrepel_0.9.4 farver_2.1.1
[85] htmltools_0.5.6 lifecycle_1.0.3 httr_1.4.7 bit64_4.0.5
[89] MASS_7.3-60