changeset 0:69714f06f18b draft default tip

"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tools/simtext commit 63a5e13cf89cdd209d20749c582ec5b8dde4e208"
author iuc
date Wed, 24 Mar 2021 08:33:56 +0000
parents
children
files README.md abstracts_by_pmids.R macros.xml pmids_to_pubtator_matrix.R pmids_to_pubtator_matrix.xml pubmed_by_queries.R test-data/abstracts_by_pmids_output test-data/pmids_to_pubtator_matrix_output test-data/pmids_to_pubtator_matrix_output_byid test-data/pmids_to_pubtator_matrix_output_number test-data/pubmed_by_queries_output test-data/pubmed_by_queries_output_abstracts test-data/test_data test-data/text_to_wordmatrix_output test-data/text_to_wordmatrix_output_args test/commands_tests text_to_wordmatrix.R
diffstat 17 files changed, 1155 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/README.md	Wed Mar 24 08:33:56 2021 +0000
@@ -0,0 +1,198 @@
+# SimText
+
+A text mining framework for interactive analysis and visualization of similarities among biomedical entities.
+
+## Brief overview of tools:
+
+ - pubmed_by_queries: 
+
+ For each search query, PMIDs or abstracts from PubMed are saved.
+
+ - abstracts_by_pmids: 
+
+ For all PMIDs in each row of a table the according abstracts are saved in additional columns.
+
+ - text_to_wordmatrix: 
+
+ The most frequent words of text from each row are extracted and united in one large binary matrix. 
+ 
+ - pmids_to_pubtator_matrix: 
+
+ For PMIDs of each row, scientific words are extracted using PubTator annotations and subsequently united in one large binary matrix. 
+
+ - simtext_app: 
+
+ Shiny app with word clouds, dimension reduction plot, dendrogram of hierarchical clustering and table with words and their frequency among the search queries.
+
+## Set up user credentials on Galaxy
+
+To enable users to set their credentials (NCBI API Key) for this tool,
+make sure the file `config/user_preferences_extra_conf.yml` has the following section:
+
+```
+preferences:
+    ncbi_account:
+        description: NCBI account information
+        inputs:
+            - name: apikey
+              label: NCBI API Key (available from "API Key Management" at https://www.ncbi.nlm.nih.gov/account/settings/)
+              type: text
+              required: False
+
+```
+
+## Requirements command-line version
+
+ - R (version > 4.0.0)
+
+## Installation command-line version
+
+```
+$ mkdir -p <path>/simtext
+$ cd <path>/simtext
+$ git clone https://github.com/dlal-group/simtext
+```
+
+## pubmed_by_queries
+
+This tool uses a set of search queries to download a defined number of abstracts or PMIDs for each search query from PubMed. PubMed's search rules and syntax apply. Users can obtain an API key from the Settings page of their NCBI account (to create an account, visit http://www.ncbi.nlm.nih.gov/account/). If the tool is used as command-line tool the API key is passed as an argument. For usage in Galaxy the API key is added to the Galaxy user-preferences (User/ Preferences/ Manage Information).
+
+Input:
+
+Tab-delimited table with a list of search queries (biomedical entities of interest) in one column. The column header should start with "ID_" (e.g., "ID_gene" if search queries are genes). 
+
+Usage:
+```
+$ Rscript pubmed_by_queries.R [-h] [-i INPUT] [-o OUTPUT] [-n NUMBER] [-a] [-k KEY] [--install_packages]
+```
+
+Optional arguments: 
+```
+ -h, --help                  show help message
+ -i INPUT, --input INPUT     input file name. add path if file is not in working directory
+ -o OUTPUT, --output OUTPUT  output file name [default "pubmed_by_queries_output"]
+ -n NUMBER, --number NUMBER  number of PMIDs or abstracts to save per ID [default "5"]
+ -a, --abstract              if abstracts instead of PMIDs should be retrieved use --abstracts 
+ -k KEY, --key KEY           if NCBI API key is available, add it to speed up the download of PubMed data. For usage in Galaxy add the API key to the Galaxy user-preferences (User/ Preferences/ Manage Information).
+ --install_packages          if you want to auto install missing required packages
+```
+
+Output: 
+
+A table with additional columns containing PMIDs or abstracts from PubMed.
+
+## abstracts_by_pmids
+
+This tool retrieves abstracts for a matrix of PMIDs. The abstract text is saved in additional columns.
+
+Input:
+
+Tab-delimited table with rows representing biomedical entities and columns containing the corresponding PMIDs. The names of the PMID columns should start with “PMID_” (e.g., “PMID_1”, “PMID_2” etc.).
+
+Usage:
+```
+$ Rscript abstracts_by_pmid.R [-h] [-i INPUT] [-o OUTPUT]
+```
+
+Optional arguments: 
+```
+ -h, --help                 show help message
+ -i INPUT, --input INPUT    input file name. add path if file is not in working directory
+ -o OUTPUT, --output OUTPUT output file name [default "abstracts_by_pmids_output"]
+ --install_packages         if you want to auto install missing required packages
+```
+
+Output: 
+
+A table with additional columns containing abstract texts.
+
+## text_to_wordmatrix
+
+The tool extracts for each row the most frequent words from the text in columns starting with "ABSTRACT" or "TEXT. The extracted words from each row are united in one large binary matrix, with 0= word not frequently occurring in text of that row and 1= word frequently present in text of that row.
+
+Input: 
+
+The output of ‘pubmed_by_queries’ or ‘abstracts_by_pmids’ tools, or a tab-delimited table with text in columns starting with "ABSTRACT" or "TEXT".
+
+Usage:
+```
+$ Rscript text_to_wordmatrix.R [-h] [-i INPUT] [-o OUTPUT] [-n NUMBER] [-r] [-l] [-w] [-s] [-p]
+```
+
+Optional arguments: 
+```
+ -h, --help                    show help message
+ -i INPUT, --input INPUT       input file name. add path if file is not in working directory
+ -o OUTPUT, --output OUTPUT    output file name. [default "text_to_wordmatrix_output"]
+ -n NUMBER, --number NUMBER    number of most frequent words that should be extracted per row [default "50"]
+ -r, --remove_num              remove any numbers in text
+ -l, --lower_case              by default all characters are translated to lower case. otherwise use -l
+ -w, --remove_stopwords        by default a set of english stopwords (e.g., 'the' or 'not') are removed. otherwise use -w
+ -s, --stemDoc                 apply Porter's stemming algorithm: collapsing words to a common root to aid comparison of vocabulary
+ -p, --plurals                 by default words in plural and singular are merged to the singular form. otherwise use -p
+ -- install_packages           if you want to auto install missing required packages
+```
+
+Output: 
+
+A binary matrix in that each column represents one of the extracted words.
+
+## pmids_to_pubtator_matrix
+
+The tool uses all PMIDs per row and extracts "Gene", "Disease", "Mutation", "Chemical" and "Species" terms of the corresponding abstracts, using PubTator annotations. The user can choose from which categories terms should be extracted. The extracted terms are united in one large binary matrix, with 0= term not present in abstracts of that row and 1= term present in abstracts of that row. The user can decide if the scientific terms should be extracted and used as they are or if they should be grouped by their geneIDs/ meshIDs (several terms are often grouped into one ID). Also, by default all terms are extracted, otherwise the user can specify a number of most frequent words to extract per row.
+
+Input: 
+
+Output of 'abstracts_by_pmids' tool, or tab-delimited table with columns containing PMIDs. The names of the PMID columns should start with "PMID", e.g. "PMID_1", "PMID_2" etc.
+
+Usage:
+```
+$ Rscript pmids_to_pubtator_matrix.R [-h] [-i INPUT] [-o OUTPUT] [-b BYID] [-n NUMBER][-c {Gene,Disease,Mutation,Chemical,Species} [{Gene,Disease,Mutation,Chemical,Species} ...]]
+```
+ 
+Optional arguments:
+```
+ -h, --help                    show help message
+ -i INPUT, --input INPUT       input file name. add path if file is not in workind directory
+ -o OUTPUT, --output OUTPUT    output file name. [default "pmids_to_pubtator_matrix_output"]
+ -b, --byid                    if you want to find common gene IDs / mesh IDs instead of specific scientific terms.
+ -n NUMBER, --number NUMBER    number of most frequent terms/IDs to extract. by default all terms/IDs are extracted.
+ -c [...], --categories [...]  PubTator categories that should be considered [default "('Gene', 'Disease', 'Mutation','Chemical')"]
+ -- install_packages           if you want to auto install missing required packages
+```
+
+Output: 
+
+Binary matrix in that each column represents one of the extracted terms.
+
+## simtext_app
+
+The tool enables the exploration of data generated by ‘text_to_wordmatrix’ or ‘pmids_to_pubtator_matrix’ tools in a Shiny local instance. The following features can be generated: 1) word clouds for each initial search query, 2) dimension reduction and hierarchical clustering of binary matrices, and 3) tables with words and their frequency in the search queries.
+
+Input:
+
+1)	Input 1: 
+Tab-delimited table with
+	- A column with initial search queries starting with "ID_" (e.g., "ID_gene" if initial search queries were genes).
+	- Column(s) with grouping factor(s) to compare pre-existing categories of the initial search queries with the grouping based on text. The column names should start with "GROUPING_". If the column name is "GROUPING_disorder", "disorder" will be shown as a grouping variable in the app.
+2)	Input 2: 
+The output of ‘text_to_wordmatrix’ or ‘pmids_to_pubtator_matrix’ tools, or a binary matrix.
+
+Usage:
+```
+$ Rscript simtext_app.R [-h] [-i INPUT] [-m MATRIX] [-p PORT]
+```
+
+Optional arguments:
+```
+ -h,        --help             show help message
+ -i INPUT,  --input INPUT      input file name. add path if file is not in working directory
+ -m MATRIX, --matrix MATRIX    matrix file name. add path if file is not in working directory
+ -p PORT,   --port PORT        specify port, otherwise randomly selected
+ --host					specify host
+ -- install_packages           if you want to auto install missing required packages
+```
+
+Output: 
+
+SimText app
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/abstracts_by_pmids.R	Wed Mar 24 08:33:56 2021 +0000
@@ -0,0 +1,142 @@
+#!/usr/bin/env Rscript
+#TOOL2 abstracts_by_pmids
+#
+#This tool retrieves for all PMIDs in each row of a table the according abstracts and saves them in additional columns.
+#
+#Input: Tab-delimited table with columns containing PMIDs. The names of the PMID columns should start with “PMID”, e.g. “PMID_1”, “PMID_2” etc.
+#
+#Output: Input table with additional columns containing abstracts corresponding to the PMIDs from PubMed.
+#The abstract columns are called "ABSTRACT_1", "ABSTARCT_2" etc.
+#
+# Usage: $ T2_abstracts_by_pmid.R [-h] [-i INPUT] [-o OUTPUT]
+#
+# optional arguments:
+# -h, --help                 show help message
+# -i INPUT, --input INPUT    input file name. add path if file is not in working directory
+# -o OUTPUT, --output OUTPUT output file name. [default "T2_output"]
+
+
+if ("--install_packages" %in% commandArgs()) {
+  print("Installing packages")
+  if (!require("argparse")) install.packages("argparse", repo = "http://cran.rstudio.com/");
+  if (!require("reutils")) install.packages("reutils", repo = "http://cran.rstudio.com/");
+  if (!require("easyPubMed")) install.packages("easyPubMed", repo = "http://cran.rstudio.com/");
+  if (!require("textclean")) install.packages("textclean", repo = "http://cran.rstudio.com/");
+}
+
+suppressPackageStartupMessages(library("argparse"))
+library("reutils")
+suppressPackageStartupMessages(library("easyPubMed"))
+suppressPackageStartupMessages(library("textclean"))
+
+parser <- ArgumentParser()
+parser$add_argument("-i", "--input",
+                    help = "input fie name. add path if file is not in workind directory")
+parser$add_argument("-o", "--output", default = "abstracts_by_pmids_output",
+                    help = "output file name. [default \"%(default)s\"]")
+parser$add_argument("--install_packages", action = "store_true", default = FALSE,
+                    help = "If you want to auto install missing required packages.")
+
+args <- parser$parse_args()
+
+data <- read.delim(args$input, stringsAsFactors = FALSE, header = TRUE, sep = "\t")
+pmids_cols_index <- grep("PMID", names(data))
+
+fetch_abstracts <- function(pmids, row) {
+
+  efetch_result <- NULL
+  try_num <- 1
+  t_0 <- Sys.time()
+
+  while (is.null(efetch_result)) {
+
+    # Timing check: kill at 3 min
+    if (try_num > 1) {
+      Sys.sleep(time = 1 * try_num)
+      cat("Problem to receive PubMed data or error is received. Please wait. Try number: ", try_num, "\n")
+    }
+
+    t_1 <- Sys.time()
+
+    if (as.numeric(difftime(t_1, t_0, units = "mins")) > 3) {
+      message("Killing the request! Something is not working. Please, try again later", "\n")
+      return(data)
+    }
+
+    efetch_result <- tryCatch({
+      suppressWarnings(efetch(uid = pmids, db = "pubmed", retmode = "xml"))
+    }, error = function(e) {
+      NULL
+    })
+
+    if (!is.null(as.list(efetch_result$errors)$error)) {
+      if (as.list(efetch_result$errors)$error == "HTTP error: Status 400; Bad Request") {
+        efetch_result <- NULL
+      }
+    }
+
+    try_num <- try_num + 1
+
+  } #while loop end
+
+  # articles to list
+  xml_data <- strsplit(efetch_result$content, "<PubmedArticle(>|[[:space:]]+?.*>)")[[1]][-1]
+  xml_data <- sapply(xml_data, function(x) {
+    #trim extra stuff at the end of the record
+    if (!grepl("</PubmedArticle>$", x))
+      x <- sub("(^.*</PubmedArticle>).*$", "\\1", x)
+    # Rebuid XML structure and proceed
+    x <- paste("<PubmedArticle>", x)
+    gsub("[[:space:]]{2,}", " ", x)},
+    USE.NAMES = FALSE, simplify = TRUE)
+
+  abstract_text <- sapply(xml_data, function(x) {
+    custom_grep(x, tag = "AbstractText", format = "char")},
+    USE.NAMES = FALSE, simplify = TRUE)
+
+  abstracts <- sapply(abstract_text, function(x) {
+    if (length(x) > 1) {
+      x <- paste(x, collapse = " ", sep = " ")
+      x <- gsub("</{0,1}i>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}b>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}sub>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}exp>", "", x, ignore.case = T)
+    } else if (length(x) < 1) {
+      x <- NA
+    } else {
+      x <- gsub("</{0,1}i>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}b>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}sub>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}exp>", "", x, ignore.case = T)
+    }
+    x
+  },
+  USE.NAMES = FALSE, simplify = TRUE)
+
+  abstracts <- as.character(abstracts)
+
+  if (length(abstracts) > 0) {
+    data[row, sapply(seq(length(abstracts)), function(i) {
+      paste0("ABSTRACT_", i)
+      })] <- abstracts
+    cat(length(abstracts), " abstracts for PMIDs of row ", row, " are added in the table.", "\n")
+  }
+
+  return(data)
+}
+    
+
+for (row in seq(nrow(data))) {
+  pmids <-  as.character(unique(data[row, pmids_cols_index]))
+  pmids <- pmids[!pmids == "NA"]
+
+  if (length(pmids) > 0) {
+    data <- tryCatch(fetch_abstracts(pmids, row),
+                    error = function(e) {
+                      Sys.sleep(3)
+                      })
+  } else {
+    print(paste("No PMIDs in row", row))
+  }
+}
+write.table(data, args$output, sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/macros.xml	Wed Mar 24 08:33:56 2021 +0000
@@ -0,0 +1,11 @@
+<macros>
+    <token name="@VERSION@">0.0.2</token>
+
+    <xml name="citations">
+        <citations>
+            <citation type="doi">10.1101/2020.07.06.190629</citation>
+        </citations>
+    </xml>
+
+</macros>
+
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/pmids_to_pubtator_matrix.R	Wed Mar 24 08:33:56 2021 +0000
@@ -0,0 +1,231 @@
+#!/usr/bin/env Rscript
+#tool: pmids_to_pubtator_matrix
+#
+#The tool uses all PMIDs per row and extracts "Gene", "Disease", "Mutation", "Chemical" and "Species" terms of the
+#corresponding abstracts, using PubTator annotations. The user can choose from which categories terms should be extracted.
+#The extracted terms are united in one large binary matrix, with 0= term not present in abstracts of that row and 1= term
+#present in abstracts of that row. The user can decide if the extracted scientific terms should be extracted and used as
+#they are or if they should be grouped by their geneIDs/ meshIDs (several terms can often be grouped into one ID).
+#äAlso, by default all terms are extracted, otherwise the user can specify a number of most frequent words to be extracted per row.
+#
+#Input: Output of abstracts_by_pmids or tab-delimited table with columns containing PMIDs.
+#The names of the PMID columns should start with "PMID", e.g. "PMID_1", "PMID_2" etc.
+#
+#Output: Binary matrix in that each column represents one of the extracted terms.
+#
+# usage: $ pmids_to_pubtator_matrix.R [-h] [-i INPUT] [-o OUTPUT] [-n NUMBER]
+# [-c {Genes,Diseases,Mutations,Chemicals,Species} [{Genes,Diseases,Mutations,Chemicals,Species} ...]]
+#
+# optional arguments:
+#   -h, --help                 show help message
+#   -i INPUT, --input INPUT    input file name. add path if file is not in workind directory
+#   -n NUMBER, --number NUMBER Number of most frequent terms/IDs to extract. By default all terms/IDs are extracted.
+#   -o OUTPUT, --output OUTPUT output file name. [default "pmids_to_pubtator_matrix_output"]
+#   -c {Gene,Disease,Mutation,Chemical,Species} [{Genes,Diseases,Mutations,Chemicals,Species} ...], --categories {Gene,Disease,Mutation,Chemical,Species} [{Gene,Disease,Mutation,Chemical,Species} ...]
+#      Pubtator categories that should be considered.  [default "('Gene', 'Disease', 'Mutation','Chemical')"]
+
+if ("--install_packages" %in% commandArgs()) {
+  print("Installing packages")
+  if (!require("argparse")) install.packages("argparse", repo = "http://cran.rstudio.com/");
+  if (!require("stringr")) install.packages("stringr", repo = "http://cran.rstudio.com/");
+  if (!require("RCurl")) install.packages("RCurl", repo = "http://cran.rstudio.com/");
+  if (!require("stringi")) install.packages("stringi", repo = "http://cran.rstudio.com/");
+}
+
+suppressPackageStartupMessages(library("argparse"))
+library("stringr")
+library("RCurl")
+library("stringi")
+
+parser <- ArgumentParser()
+
+parser$add_argument("-i", "--input",
+                    help = "input fie name. add path if file is not in workind directory")
+parser$add_argument("-o", "--output", default = "pmids_to_pubtator_matrix_output",
+                    help = "output file name. [default \"%(default)s\"]")
+parser$add_argument("-c", "--categories", choices = c("Gene", "Disease", "Mutation", "Chemical", "Species"), nargs = "+",
+                    default = c("Gene", "Disease", "Mutation", "Chemical"),
+                    help = "Pubtator categories that should be considered. [default \"%(default)s\"]")
+parser$add_argument("-b", "--byid", action = "store_true", default = FALSE,
+                    help = "If you want to find common gene IDs / mesh IDs instead of scientific terms.")
+parser$add_argument("-n", "--number", default = NULL, type = "integer",
+                    help = "Number of most frequent terms/IDs to extract. By default all terms/IDs are extracted.")
+parser$add_argument("--install_packages", action = "store_true", default = FALSE,
+                    help = "If you want to auto install missing required packages.")
+
+args <- parser$parse_args()
+
+
+data <- read.delim(args$input, stringsAsFactors = FALSE, header = TRUE, sep = "\t")
+
+pmid_cols_index <- grep(c("PMID"), names(data))
+word_matrix <- data.frame()
+dict_table <- data.frame()
+pmids_count <- 0
+pubtator_max_ids <- 100
+
+
+merge_pubtator_table <- function(out_data, table) {
+  out_data <- unlist(strsplit(out_data, "\n", fixed = T))
+  for (i in 3:length(out_data)) {
+    temps <- unlist(strsplit(out_data[i], "\t", fixed = T))
+    if (length(temps) == 5) {
+      temps <- c(temps, NA)
+    }
+    if (length(temps) == 6) {
+      table <- rbind(table, temps)
+    }
+  }
+  return(table)
+}
+
+
+get_pubtator_terms <- function(pmids) {
+  table <- NULL
+  for (pmid_split in split(pmids, ceiling(seq_along(pmids) / pubtator_max_ids))) {
+    out_data <- NULL
+    try_num <- 1
+    t_0 <- Sys.time()
+    while (TRUE) {
+      # Timing check: kill at 3 min
+      if (try_num > 1) {
+        cat("Connection problem. Please wait. Try number:", try_num, "\n")
+        Sys.sleep(time = 2 * try_num)
+      }
+      try_num <- try_num + 1
+      t_1 <- Sys.time()
+      if (as.numeric(difftime(t_1, t_0, units = "mins")) > 3) {
+        message("Killing the request! Something is not working. Please, try again later", "\n")
+        return(table)
+      }
+      out_data <- tryCatch({
+        getURL(paste("https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/pubtator?pmids=",
+                     paste(pmid_split, collapse = ","), sep = ""))
+      }, error = function(e) {
+        print(e)
+        next
+      }, finally = {
+        Sys.sleep(0)
+      })
+      if (!is.null(out_data)) {
+        table <- merge_pubtator_table(out_data, table)
+        break
+      }
+    }
+  }
+  return(table)
+}
+
+extract_category_terms <- function(table, categories) {
+  index_categories <- c()
+  categories <- as.character(unlist(categories))
+  if (ncol(table) == 6) {
+    for (i in categories) {
+      tmp_index <- grep(TRUE, i == as.character(table[, 5]))
+      if (length(tmp_index) > 0) {
+        index_categories <- c(index_categories, tmp_index)
+      }
+    }
+    table <- as.data.frame(table, stringsAsFactors = FALSE)
+    table <- table[index_categories, c(4, 6)]
+    table <- table[!is.na(table[, 2]), ]
+    table <- table[!(table[, 2] == "NA"), ]
+    table <- table[!(table[, 1] == "NA"), ]
+  }else{
+    return(NULL)
+  }
+}
+
+extract_frequent_ids_or_terms <- function(table) {
+  if (is.null(table)) {
+    return(NULL)
+    break
+  }
+  if (args$byid) {
+    if (!is.null(args$number)) {
+      #retrieve top X mesh_ids
+      table_mesh <- as.data.frame(table(table[, 2]))
+      colnames(table_mesh)[1] <- "mesh_id"
+      table <- table[order(table_mesh$Freq, decreasing = TRUE), ]
+      table <- table[1:min(args$number, nrow(table_mesh)), ]
+      table_mesh$mesh_id <- as.character(table_mesh$mesh_id)
+      #subset table for top X mesh_ids
+      table <- table[which(as.character(table$V6) %in% as.character(table_mesh$mesh_id)), ]
+      table <- table[!duplicated(table[, 2]), ]
+    } else {
+      table <- table[!duplicated(table[, 2]), ]
+    }
+  } else {
+    if (!is.null(args$number)) {
+      table[, 1] <- tolower(as.character(table[, 1]))
+      table <- as.data.frame(table(table[, 1]))
+      colnames(table)[1] <- "term"
+      table <- table[order(table$Freq, decreasing = TRUE), ]
+      table <- table[1:min(args$number, nrow(table)), ]
+      table$term <- as.character(table$term)
+    } else {
+      table[, 1] <- tolower(as.character(table[, 1]))
+      table <- table[!duplicated(table[, 1]), ]
+    }
+  }
+  return(table)
+}
+
+
+#for all PMIDs of a row get PubTator terms and add them to the matrix
+for (i in seq(nrow(data))) {
+  pmids <- as.character(data[i, pmid_cols_index])
+  pmids <- pmids[!pmids == "NA"]
+  if (pmids_count > 10000) {
+    cat("Break (10s) to avoid killing of requests. Please wait.", "\n")
+    Sys.sleep(10)
+    pmids_count <- 0
+  }
+  pmids_count <- pmids_count + length(pmids)
+  #get puptator terms and process them with functions
+  if (length(pmids) > 0) {
+    table <- get_pubtator_terms(pmids)
+    table <- extract_category_terms(table, args$categories)
+    table <- extract_frequent_ids_or_terms(table)
+    if (!is.null(table)) {
+      colnames(table) <- c("term", "mesh_id")
+      # add data in binary matrix
+      if (args$byid) {
+        mesh_ids <- as.character(table$mesh_id)
+        if (length(mesh_ids) > 0) {
+          word_matrix[i, mesh_ids] <- 1
+          cat(length(mesh_ids), " IDs for PMIDs of row", i, " were added", "\n")
+          # add data in dictionary
+          dict_table <- rbind(dict_table, table)
+          dict_table <- dict_table[!duplicated(as.character(dict_table[, 2])), ]
+        }
+      } else {
+        terms <- as.character(table[, 1])
+        if (length(terms) > 0) {
+          word_matrix[i, terms] <- 1
+          cat(length(terms), " terms for PMIDs of row", i, " were added.", "\n")
+        }
+      }
+    }
+  } else {
+    cat("No terms for PMIDs of row", i, " were found.", "\n")
+  }
+}
+
+if (args$byid) {
+  #change column names of matrix: exchange mesh ids/ids with term
+  index_names <- match(names(word_matrix), as.character(dict_table[[2]]))
+  names(word_matrix) <- dict_table[index_names, 1]
+}
+
+colnames(word_matrix) <- gsub("[^[:print:]]", "", colnames(word_matrix))
+colnames(word_matrix) <- gsub('\"', "", colnames(word_matrix), fixed = TRUE)
+
+#merge duplicated columns
+word_matrix <- as.data.frame(do.call(cbind, by(t(word_matrix), INDICES = names(word_matrix), FUN = colSums)))
+
+#save binary matrix
+word_matrix <- as.matrix(word_matrix)
+word_matrix[is.na(word_matrix)] <- 0
+cat("Matrix with ", nrow(word_matrix), " rows and ", ncol(word_matrix), " columns generated.", "\n")
+write.table(word_matrix, args$output, row.names = FALSE, sep = "\t", quote = FALSE)
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/pmids_to_pubtator_matrix.xml	Wed Mar 24 08:33:56 2021 +0000
@@ -0,0 +1,109 @@
+  <tool id="pmids_to_pubtator_matrix" name="PMIDs to PubTator" version="@VERSION@" license="MIT">
+    <description>binary matrix</description>
+    <macros>
+        <import>macros.xml</import>
+    </macros>
+    <requirements>
+        <requirement type="package" version="2.0.3">r-argparse</requirement>
+        <requirement type="package" version="1.4.0">r-stringr</requirement>
+        <requirement type="package" version="1.98_1.2">r-rcurl</requirement>
+        <requirement type="package" version="1.5.3">r-stringi</requirement>
+    </requirements>
+    <command detect_errors="exit_code"><![CDATA[
+    Rscript 
+      '${__tool_directory__}/pmids_to_pubtator_matrix.R'
+      --input '$input'
+      --output '$output'
+      --number '$number'
+      $byid
+      --categories 
+      #for $category in $categories:
+        '$category'
+      #end for
+      ]]>
+    </command>
+    <inputs>
+        <param argument="--input" type="data" format="tabular" label="Input file with PMID IDs" />
+        <param argument="--categories" type="select" label="categories" multiple="true" display="checkboxes">
+            <option value="Gene">Genes</option>
+            <option value="Disease">Diseases</option>
+            <option value="Mutation">Mutations</option>
+            <option value="Chemical">Chemicals</option>
+            <option value="Species">Species</option>
+        </param>
+        <param argument="--byid" label="If you want to find common gene IDs / mesh IDs instead of specific scientific terms." name="byid" type="boolean" truevalue="--byid" falsevalue="" help="byid" checked="false"/>
+        <param argument="--number" label="Number of most frequent terms/IDs to extract." name="number" optional="true" type="integer" help="number" value="50"/>
+    </inputs>
+    <outputs>
+        <data format="tabular" name="output" />
+    </outputs>
+    <tests>
+        <test>
+            <param name="input" value="pubmed_by_queries_output" ftype="tabular"/>
+            <param name="categories" value="Gene,Mutation"/>
+            <output name="output">
+                <assert_contents>
+                    <has_n_lines n="7"/>
+                </assert_contents>
+            </output>
+        </test>
+        <test>
+            <param name="input" value="pubmed_by_queries_output" ftype="tabular"/>
+            <param name="categories" value="Gene,Disease"/>
+            <param name="byid" value="True"/>
+            <output name="output">
+                <assert_contents>
+                    <has_n_lines n="7"/>
+                </assert_contents>
+            </output>
+        </test>
+        <test>
+            <param name="input" value="pubmed_by_queries_output" ftype="tabular"/>
+            <param name="categories" value="Gene,Disease"/>
+            <param name="number" value="5"/>
+            <output name="output">
+                <assert_contents>
+                    <has_n_lines n="7"/>
+                </assert_contents>
+            </output>
+        </test>
+    </tests>
+    <help><![CDATA[
+
+**What it does**
+
+The tool uses all PMIDs per row and extracts "Gene", "Disease", "Mutation", "Chemical" and "Species" terms of the corresponding abstracts, 
+using PubTator annotations. The user can choose from which categories terms should be extracted. The extracted terms are united in one
+large binary matrix, with 0= term not present in abstracts of that row and 1= term present in abstracts of that row.
+The user can decide if the scientific terms should be extracted and used as they are or if they should be grouped by their
+geneIDs/ meshIDs (several terms are often grouped into one ID). The the user can specify a number of most frequent words to extract per row.
+
+- Input file:
+
+    Output of 'abstracts_by_pmids' tool, or tab-delimited table with columns containing PMIDs. 
+    The names of the PMID columns should start with "PMID", e.g. "PMID_1", "PMID_2" etc.
+
+- Output file: 
+
+    Binary matrix in that each column represents one of the extracted terms.
+
+-----
+
+**Example**
+
+- Input table:
+
+    | PMID_1      | PMID_2      | PMID_2    
+    | 33565071    | 33531663    | 33528079  
+    | 33377604    | 33334860    | 33277917
+
+- Extract of output table:
+
+    | egfr        | hormone     | tp53        | scn8a       | cacna1a     | grin2a      
+    | 1           | 0           | 1           | 0           | 1           | 0           
+    | 1           | 1           | 1           | 1           | 0           | 1           
+
+
+        ]]></help>
+    <expand macro="citations"/>
+</tool>
\ No newline at end of file
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/pubmed_by_queries.R	Wed Mar 24 08:33:56 2021 +0000
@@ -0,0 +1,258 @@
+#!/usr/bin/env Rscript
+#tool: pubmed_by_queries
+#
+#This tool uses a set of search queries to download a defined number of abstracts or
+#PMIDs for search query from PubMed. PubMed's search rules and syntax apply.
+#
+#Input: Tab-delimited table with search queries in a column starting with "ID_",
+#e.g. "ID_gene" if search queries are genes.
+#
+#Output: Input table with additional columns
+#with PMIDs or abstracts (--abstracts) from PubMed.
+#
+#Usage:
+#$pubmed_by_queries.R [-h] [-i INPUT] [-o OUTPUT] [-n NUMBER] [-a] [-k KEY]
+#
+#optional arguments:
+# -h, --help                  show this help message and exit
+# -i INPUT, --input INPUT     input file name. add path if file is not in working directory
+# -o OUTPUT, --output OUTPUT  output file name. [default "pubmed_by_queries_output"]
+# -n NUMBER, --number NUMBER  number of PMIDs or abstracts to save per ID [default "5"]
+# -a, --abstract              if abstracts instead of PMIDs should be retrieved use --abstracts
+# -k KEY, --key KEY           if ncbi API key is available, add it to speed up the download of PubMed data.
+# For usage in Galaxy add the API key to the Galaxy user-preferences (User/ Preferences/ Manage Information).
+
+if ("--install_packages" %in% commandArgs()) {
+  print("Installing packages")
+  if (!require("argparse")) install.packages("argparse", repo = "http://cran.rstudio.com/") ;
+  if (!require("easyPubMed")) install.packages("easyPubMed", repo = "http://cran.rstudio.com/") ;
+}
+
+suppressPackageStartupMessages(library("argparse"))
+suppressPackageStartupMessages(library("easyPubMed"))
+
+parser <- ArgumentParser()
+parser$add_argument("-i", "--input",
+                    help = "Input fie name. add path if file is not in working directory")
+parser$add_argument("-o", "--output", default = "pubmed_by_queries_output",
+                    help = "Output file name. [default \"%(default)s\"]")
+parser$add_argument("-n", "--number", type = "integer", default = 5,
+                    help = "Number of PMIDs (or abstracts) to save per  ID. [default \"%(default)s\"]")
+parser$add_argument("-a", "--abstract", action = "store_true", default = FALSE,
+                    help = "If abstracts instead of PMIDs should be retrieved use --abstracts ")
+parser$add_argument("-k", "--key", type = "character",
+                    help = "If ncbi API key is available, add it to speed up the download of PubMed data. For usage in Galaxy add the API key to the Galaxy user-preferences (User/ Preferences/ Manage Information).")
+parser$add_argument("--install_packages", action = "store_true", default = FALSE,
+                    help = "If you want to auto install missing required packages.")
+args <- parser$parse_args()
+
+if (!is.null(args$key)) {
+  if (file.exists(args$key)) {
+    credentials <- read.table(args$key, quote = "\"", comment.char = "")
+    args$key <- credentials[1, 1]
+  }
+}
+
+max_web_tries <- 100
+
+data <- read.delim(args$input, stringsAsFactors = FALSE)
+
+id_col_index <- grep("ID_", names(data))
+
+
+fetch_pmids <- function(data, number, pubmed_search, query, row, max_web_tries) {
+  my_pubmed_url <- paste("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?",
+                         "db=pubmed&retmax=", number,
+                         "&term=", pubmed_search$OriginalQuery,
+                         "&usehistory=n", sep = "")
+  # get ids
+  idxml <- c()
+  for (i in seq(max_web_tries)) {
+    tryCatch({
+      id_connect <- suppressWarnings(url(my_pubmed_url, open = "rb", encoding = "UTF8"))
+      idxml <- suppressWarnings(readLines(id_connect, warn = FALSE, encoding = "UTF8"))
+      suppressWarnings(close(id_connect))
+      break
+    }, error = function(e) {
+      print(paste("Error getting URL, sleeping", 2 * i, "seconds."))
+      print(e)
+      Sys.sleep(time = 2 * i)
+    })
+  }
+  pmids <- c()
+  for (i in seq(length(idxml))) {
+    if (grepl("^<Id>", idxml[i])) {
+      pmid <- custom_grep(idxml[i], tag = "Id", format = "char")
+      pmids <- c(pmids, as.character(pmid[1]))
+    }
+  }
+  if (length(pmids) > 0) {
+    data[row, sapply(seq(length(pmids)), function(i) {
+      paste0("PMID_", i)
+    })] <- pmids
+    cat(length(pmids), " PMIDs for ", query, " are added in the table.",  "\n")
+  }
+  return(data)
+}
+
+
+fetch_abstracts <- function(data, number, query, pubmed_search) {
+  efetch_url <- paste("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?",
+                      "db=pubmed&WebEnv=", pubmed_search$WebEnv, "&query_key=", pubmed_search$QueryKey,
+                      "&retstart=", 0, "&retmax=", number,
+                      "&rettype=", "null", "&retmode=", "xml", sep = "")
+  api_key <- pubmed_search$APIkey
+  if (!is.null(api_key)) {
+    efetch_url <- paste(efetch_url, "&api_key=", api_key, sep = "")
+  }
+  # initialize
+  out_data <- NULL
+  try_num <- 1
+  t_0 <- Sys.time()
+  # Try to fetch results
+  while (is.null(out_data)) {
+    # Timing check: kill at 3 min
+    if (try_num > 1) {
+      Sys.sleep(time = 2 * try_num)
+      cat("Problem to receive PubMed data or error is received. Please wait. Try number:",
+          try_num, "\n")
+    }
+    t_1 <- Sys.time()
+    if (as.numeric(difftime(t_1, t_0, units = "mins")) > 3) {
+      message("Killing the request! Something is not working. Please, try again later",
+              "\n")
+      return(data)
+    }
+    # ENTREZ server connect
+    out_data <- tryCatch({
+      tmp_connect <- suppressWarnings(url(efetch_url,
+                                          open = "rb",
+                                          encoding = "UTF8"))
+      suppressWarnings(readLines(tmp_connect,
+                                 warn = FALSE,
+                                 encoding = "UTF8"))
+    }, error = function(e) {
+      print(e)
+    }, finally = {
+      try(suppressWarnings(close(tmp_connect)),
+          silent = TRUE)
+    })
+    # Check if error
+    if (!is.null(out_data) &&
+        class(out_data) == "character" &&
+        grepl("<ERROR>", substr(paste(utils::head(out_data, n = 100),
+                                      collapse = ""), 1, 250))) {
+      out_data <- NULL
+    }
+    try_num <- try_num + 1
+  }
+  if (is.null(out_data)) {
+    message("Killing the request! Something is not working. Please, try again later",
+            "\n")
+    return(data)
+  } else {
+    return(out_data)
+  }
+}
+
+
+process_xml_abstracts <- function(out_data) {
+  xml_data <- paste(out_data, collapse = "")
+  # articles to list
+  xml_data <- strsplit(xml_data, "<PubmedArticle(>|[[:space:]]+?.*>)")[[1]][-1]
+  xml_data <- sapply(xml_data, function(x) {
+    #trim extra stuff at the end of the record
+    if (!grepl("</PubmedArticle>$", x))
+      x <- sub("(^.*</PubmedArticle>).*$", "\\1", x)
+    # Rebuid XML structure and proceed
+    x <- paste("<PubmedArticle>", x)
+    gsub("[[:space:]]{2,}", " ", x)
+  },
+  USE.NAMES = FALSE, simplify = TRUE)
+  #titles
+  titles <- sapply(xml_data, function(x) {
+    x <- custom_grep(x, tag = "ArticleTitle", format = "char")
+    x <- gsub("</{0,1}i>", "", x, ignore.case = T)
+    x <- gsub("</{0,1}b>", "", x, ignore.case = T)
+    x <- gsub("</{0,1}sub>", "", x, ignore.case = T)
+    x <- gsub("</{0,1}exp>", "", x, ignore.case = T)
+    if (length(x) > 1) {
+      x <- paste(x, collapse = " ", sep = " ")
+    } else if (length(x) < 1) {
+      x <- NA
+    }
+    x
+  },
+  USE.NAMES = FALSE, simplify = TRUE)
+  # abstracts
+  abstract_text <- sapply(xml_data, function(x) {
+    custom_grep(x, tag = "AbstractText", format = "char")
+  },
+  USE.NAMES = FALSE, simplify = TRUE)
+  abstracts <- sapply(abstract_text, function(x) {
+    if (length(x) > 1) {
+      x <- paste(x, collapse = " ", sep = " ")
+      x <- gsub("</{0,1}i>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}b>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}sub>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}exp>", "", x, ignore.case = T)
+    } else if (length(x) < 1) {
+      x <- NA
+    } else {
+      x <- gsub("</{0,1}i>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}b>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}sub>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}exp>", "", x, ignore.case = T)
+    }
+    x
+  },
+  USE.NAMES = FALSE, simplify = TRUE)
+  #add title to abstracts
+  if (length(titles) == length(abstracts)) {
+    abstracts <- paste(titles,  abstracts)
+  }
+  return(abstracts)
+}
+
+
+pubmed_data_in_table <- function(data, row, query, number, key, abstract) {
+  if (is.null(query)) {
+    print(data)
+  }
+  pubmed_search <- get_pubmed_ids(query, api_key = key)
+  if (as.numeric(pubmed_search$Count) == 0) {
+    cat("No PubMed result for the following query: ", query, "\n")
+    return(data)
+  } else if (abstract == FALSE) { # fetch PMIDs
+    data <- fetch_pmids(data, number, pubmed_search, query, row, max_web_tries)
+    return(data)
+  } else if (abstract == TRUE) { # fetch abstracts and title text
+    out_data <- fetch_abstracts(data, number, query, pubmed_search)
+    abstracts <- process_xml_abstracts(out_data)
+    #add abstracts to data frame
+    if (length(abstracts) > 0) {
+      data[row, sapply(seq(length(abstracts)),
+                       function(i) {
+                         paste0("ABSTRACT_", i)
+                       })] <- abstracts
+      cat(length(abstracts), " abstracts for ", query, " are added in the table.",
+          "\n")
+    }
+    return(data)
+  }
+}
+
+for (i in seq(nrow(data))) {
+  data <- tryCatch(pubmed_data_in_table(data = data,
+                                        row = i,
+                                        query = data[i, id_col_index],
+                                        number = args$number,
+                                        key = args$key,
+                                        abstract = args$abstract), error = function(e) {
+                                          print("main error")
+                                          print(e)
+                                          Sys.sleep(5)
+                                        })
+}
+
+write.table(data, args$output, append = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/abstracts_by_pmids_output	Wed Mar 24 08:33:56 2021 +0000
@@ -0,0 +1,7 @@
+ID_gene	GROUPING_disease	PMID_1	PMID_2	PMID_3	PMID_4	PMID_5	ABSTRACT_1	ABSTRACT_2	ABSTRACT_3	ABSTRACT_4	ABSTRACT_5
+SCN1A	epilepsy	33565071	33531663	33528079	33519675	33478845	To analyze the clinical features and genetic variants in two patients with Dravet syndrome (DS). Peripheral blood samples of the children and their parents were collected for the extraction of genomic DNA and high-throughput sequencing. Suspected variants were confirmed by Sanger sequencing. By high-throughput sequencing, the two children were found to respectively harbor a c.2135delC frameshifting variant in exon 12 and a c.1522G&gt;T nonsense variant in exon 10 of the SCN1A gene. Both variants were predicted to be pathogenic by bioinformatic analysis. Based on the American College of Medical Genetics and Genomics standards and guidelines, the c.2135delC and c.1522G&gt;A variants of the SCN1A gene were predicted to be pathogenic (PVS1+ PS2+ PM2+ PP3). The variants of the SCN1A gene probably underlay the DS in the patients. Above finding has enriched the variant spectrum and enabled genetic counseling for their families.	The voltage-gated sodium channel α-subunit genes comprise a highly conserved gene family. Mutations of three of these genes, SCN1A, SCN2A and SCN8A, are responsible for a significant burden of neurological disease. Recent progress in identification and functional characterization of patient variants is generating new insights and novel approaches to therapy for these devastating disorders. Here we review the basic elements of sodium channel function that are used to characterize patient variants. We summarize a large body of work using global and conditional mouse mutants to characterize the in vivo roles of these channels. We provide an overview of the neurological disorders associated with mutations of the human genes and examples of the effects of patient mutations on channel function. Finally, we highlight therapeutic interventions that are emerging from new insights into mechanisms of sodium channelopathies.	Advancement in genetic technology has led to the identification of an increasing number of genes in epilepsy. This will provide a huge information in clinical practice and improve diagnosis and treatment of epilepsy. this was a single-center retrospective cohort study of 80 patients who underwent NGS testing with customize epilepsy panel. In total 54 out of 80 patients (67, 5%), pathogenic / likely pathogenic and variants of uncertain significance variants were identified according to ACMG criteria. Pathogenic or likely pathogenic variants (n=35) were identified in 29 out of 80 individuals (36.25%). Variants of uncertain significance (VOUS) (n=34) have identified in 28 out of 80 patients (35%). Pathogenic, likely pathogenic, and variants of uncertain significance (VOUS) were most frequently identified in TSC2 (n = 11), SCN1A (n = 6) and TSC1 (n = 5) genes. Other common genes were KCNQ2 (n = 3), AMT (n = 3), CACNA1H (n = 3), CLCN2 (n = 3), MECP2 (n = 2), ASAH1 (n = 2) and SLC2A1 (n = 2). NGS based testing panels contributes the diagnosis of epilepsy and may change the clinical management by preventing unnecessary and potentially harmful diagnostic procedures and management in patients. Thus, our results highlighted the benefit of genetic testing in children suffered with epilepsy. This article is protected by copyright. All rights reserved.	Background:SCN1A and SCN2A genes have been reported to be associated with the efficacy of single and combined antiepileptic therapy, but the results remain contradictory. Previous meta-analyses on this topic mainly focused on the SCN1A rs3812718 polymorphism. However, meta-analyses focused on SCN1A rs2298771, SCN1A rs10188577, SCN2A rs17183814, or SCN2A rs2304016 polymorphisms are scarce or non-existent. Objective: We aimed to conduct a meta-analysis to determine the effects of SCN1A rs2298771, SCN1A rs10188577, SCN2A rs17183814, and SCN2A rs2304016 polymorphisms on resistance to antiepileptic drugs (AEDs). Methods: We searched the PubMed, Embase, Cochrane Library, WANFANG, and CNKI databases up to June 2020 to collect studies on the association of SCN1A and SCN2A polymorphisms with reactivity to AEDs. We calculated the pooled odds ratios (ORs) under the allelic, homozygous, heterozygous, dominant, and recessive genetic models to identify the association between the four single-nucleotide polymorphisms (SNPs) and resistance to AEDs. Results: Our meta-analysis included 19 eligible studies. The results showed that the SCN1A rs2298771 polymorphism was related to AED resistance in the allelic, homozygous, and recessive genetic models (G vs. A: OR = 1.20, 95% CI: 1.012-1.424; GG vs. AA: OR = 1.567, 95% CI: 1.147-2.142; GG vs. AA + AG: OR = 1.408, 95% CI: 1.053-1.882). The homozygous model remained significant after Bonferroni correction (P &lt; 0.0125). Further subgroup analyses demonstrated the significance of the correlation in the dominant model in Caucasians (South Asians) after Bonferroni correction (GG + GA vs. AA: OR = 1.620, 95% CI: 1.165-2.252). However, no association between SCN1A rs2298771 polymorphism and resistance to AEDs was found in Asians or Caucasians (non-South Asians). For SCN1A rs10188577, SCN2A rs17183814, and SCN2A rs2304016 polymorphisms, the correlations with responsiveness to AEDs were not significant in the overall population nor in any subgroup after conducting the Bonferroni correction. The results for SCN1A rs2298771, SCN1A rs10188577, and SCN2A rs2304016 polymorphisms were stable and reliable according to sensitivity analysis and Begg and Egger tests. However, the results for SCN2A rs17183814 polymorphism have to be treated cautiously owing to the significant publication bias revealed by Begg and Egger tests. Conclusions: The present meta-analysis indicated that SCN1A rs2298771 polymorphism significantly affects resistance to AEDs in the overall population and Caucasians (South Asians). There were no significant correlations between SCN1A rs10188577, SCN2A rs17183814, and SCN2A rs2304016 polymorphisms and resistance to AEDs.	The objective of this study was to identify developmental trajectories of developmental/behavioral phenotypes and possibly their relationship to epilepsy and genotype by analyzing developmental and behavioral features collected prospectively and longitudinally in a cohort of patients with Dravet syndrome (DS). Thirty-four patients from seven Italian tertiary pediatric neurology centers were enrolled in the study. All patients were examined for the SCN1A gene mutation and prospectively assessed from the first years of life with repeated full clinical observations including neurological and developmental examinations. Subjects were found to follow three neurodevelopmental trajectories. In the first group (16 patients), an initial and usually mild decline was observed between the second and the third year of life, specifically concerning visuomotor abilities, later progressing towards global involvement of all abilities. The second group (12 patients) showed an earlier onset of global developmental impairment, progressing towards a generally worse outcome. The third group of only two patients ended up with a normal neurodevelopmental quotient, but with behavioral and linguistic problems. The remaining four patients were not classifiable due to a lack of critical assessments just before developmental decline. The neurodevelopmental trajectories described in this study suggest a differential contribution of neurobiological and genetic factors. The profile of the first group, which included the largest fraction of patients, suggests that in the initial phase of the disease, visuomotor defects might play a major role in determining developmental decline. Early diagnosis of milder cases with initial visuomotor impairment may therefore provide new tools for a more accurate habilitation strategy.
+SCN9A	epilepsy	33389681	33370834	33278787	33237934	33232657	Dorsal root ganglia (DRG) sensory neurons can transmit information about noxious stimulus to cerebral cortex via spinal cord, and play an important role in the pain pathway. Alterations of the pain pathway lead to CIPA (congenital insensitivity to pain with anhidrosis) or chronic pain. Accumulating evidence demonstrates that nerve damage leads to the regeneration of neurons in DRG, which may contribute to pain modulation in feedback. Therefore, exploring the regeneration process of DRG neurons would provide a new understanding to the persistent pathological stimulation and contribute to reshape the somatosensory function. It has been reported that a subpopulation of satellite glial cells (SGCs) express Nestin and p75, and could differentiate into glial cells and neurons, suggesting that SGCs may have differentiation plasticity. Our results in the present study show that DRG-derived SGCs (DRG-SGCs) highly express neural crest cell markers Nestin, Sox2, Sox10, and p75, and differentiate into nociceptive sensory neurons in the presence of histone deacetylase inhibitor VPA, Wnt pathway activator CHIR99021, Notch pathway inhibitor RO4929097, and FGF pathway inhibitor SU5402. The nociceptive sensory neurons express multiple functionally-related genes (SCN9A, SCN10A, SP, Trpv1, and TrpA1) and are able to generate action potentials and voltage-gated Na<sup>+</sup> currents. Moreover, we found that these cells exhibited rapid calcium transients in response to capsaicin through binding to the Trpv1 vanilloid receptor, confirming that the DRG-SGC-derived cells are nociceptive sensory neurons. Further, we show that Wnt signaling promotes the differentiation of DRG-SGCs into nociceptive sensory neurons by regulating the expression of specific transcription factor Runx1, while Notch and FGF signaling pathways are involved in the expression of SCN9A. These results demonstrate that DRG-SGCs have stem cell characteristics and can efficiently differentiate into functional nociceptive sensory neurons, shedding light on the clinical treatment of sensory neuron-related diseases.	Voltage-gated sodium channel Nav1.7 has been validated as a perspective target for selective inhibitors with analgesic and anti-itch activity. The objective of this study was to discover new candidate compounds with Nav1.7 inhibitor properties. The authors hypothesized that their approach would yield at least one new compound that inhibits sodium currents in vitro and exerts analgesic and anti-itch effects in mice. In silico structure-based similarity search of 1.5 million compounds followed by docking to the Nav1.7 voltage sensor of Domain 4 and molecular dynamics simulation was performed. Patch clamp experiments in Nav1.7-expressing human embryonic kidney 293 cells and in mouse and human dorsal root ganglion neurons were conducted to test sodium current inhibition. Formalin-induced inflammatory pain model, paclitaxel-induced neuropathic pain model, histamine-induced itch model, and mouse lymphoma model of chronic itch were used to confirm in vivo activity of the selected compound. After in silico screening, nine compounds were selected for experimental assessment in vitro. Of those, four compounds inhibited sodium currents in Nav1.7-expressing human embryonic kidney 293 cells by 29% or greater (P &lt; 0.05). Compound 9 (3-(1-benzyl-1H-indol-3-yl)-3-(3-phenoxyphenyl)-N-(2-(pyrrolidin-1-yl)ethyl)propanamide, referred to as DA-0218) reduced sodium current by 80% with a 50% inhibition concentration of 0.74 μM (95% CI, 0.35 to 1.56 μM), but had no effects on Nav1.5-expressing human embryonic kidney 293 cells. In mouse and human dorsal root ganglion neurons, DA-0218 reduced sodium currents by 17% (95% CI, 6 to 28%) and 22% (95% CI, 9 to 35%), respectively. The inhibition was greatly potentiated in paclitaxel-treated mouse neurons. Intraperitoneal and intrathecal administration of the compound reduced formalin-induced phase II inflammatory pain behavior in mice by 76% (95% CI, 48 to 100%) and 80% (95% CI, 68 to 92%), respectively. Intrathecal administration of DA-0218 produced acute reduction in paclitaxel-induced mechanical allodynia, and inhibited histamine-induced acute itch and lymphoma-induced chronic itch. This study's computer-aided drug discovery approach yielded a new Nav1.7 inhibitor that shows analgesic and anti-pruritic activity in mouse models.	This study aimed to investigate the genetic aetiology in Chinese children diagnosed with status epilepticus (SE). Next-generation sequencing, copy number variation (CNV) analysis, and other genetic testing methods were conducted for children with SE lacking an identifiable non-genetic aetiology. Furthermore, the phenotype and molecular data of patients with SE were retrospectively analysed. Among children with SE lacking an identifiable non-genetic aetiology, 73 out of 163 children (44.8 %) were found to have causative variants associated with SE including 66 monogenic mutations in 22 genes and 7 CNVs. Based on the American College of Medical Genetics and Genomics scoring system, the monogenic variants included 64 pathogenic/likely pathogenic and 2 uncertain significance variants. SCN1A gene mutations (n = 32) were the most common cause, followed by TSC2 (n = 5), CACNA1A (n = 5), SCN2A (n = 4), SCN9A (n = 2) and DEPDC5 (n = 2) gene mutations. Sixteen mutations were identified in single genes. Furthermore, 51 (77.3 %) monogenic mutations were de novo. Age at SE onset &lt; 1 year (odds ratio [OR] = 2.70, 95 % confidence interval [CI]: 1.25-5.83, p = 0.012) and co-morbidity of intellectual disability (OR = 3.36, 95 %CI: 1.61-6.99, p = 0.001) were independently associated with pathogenic genetic variants. This study identified genetic aetiology in 44.8 % of patients with SE, which indicates a high burden of genetic aetiology among children with SE in China. Our findings highlight the importance for genetic testing of children with SE that lacks an identifiable non-genetic aetiology.	Glioblastoma (GBM) is an aggressive brain tumor associated with high degree of resistance to treatment. Given its heterogeneity, it is important to understand the molecular landscape of this tumor for the development of more effective therapies. Because of the different genetic profiles of patients with GBM, we sought to identify genetic variants in Lebanese patients with GBM (LEB-GBM) and compare our findings to those in the Cancer Genome Atlas (TCGA). We performed whole exome sequencing (WES) to identify somatic variants in a cohort of 60 patient-derived GBM samples. We focused our analysis on 50 commonly mutated GBM candidate genes and compared mutation signatures between our population and publicly available GBM data from TCGA. We also cross-tabulated biological covariates to assess for associations with overall survival, time to recurrence and follow-up duration. We included 60 patient-derived GBM samples from 37 males and 23 females, with age ranging from 3 to 80 years (mean and median age at diagnosis were 51 and 56, respectively). Recurrent tumor formation was present in 94.8% of patients (n = 55/58). After filtering, we identified 360 somatic variants from 60 GBM patient samples. After filtering, we identified 360 somatic variants from 60 GBM patient samples. Most frequently mutated genes in our samples included ATRX, PCDHX11, PTEN, TP53, NF1, EGFR, PIK3CA, and SCN9A. Mutations in NLRP5 were associated with decreased overall survival among the Lebanese GBM cohort (p = 0.002). Mutations in NLRP5 were associated with decreased overall survival among the Lebanese GBM cohort (p = 0.002). EGFR and NF1 mutations were associated with the frontal lobe and temporal lobe in our LEB-GBM cohort, respectively. Our WES analysis confirmed the similarity in mutation signature of the LEB-GBM population with TCGA cohorts. It showed that 1 out of the 50 commonly GBM candidate gene mutations is associated with decreased overall survival among the Lebanese cohort. This study also highlights the need for studies with larger sample sizes to inform clinicians for better prognostication and management of Lebanese patients with GBM.	Voltage-gated sodium channels initiate electrical signals and are frequently targeted by deadly gating-modifier neurotoxins, including tarantula toxins, which trap the voltage sensor in its resting state. The structural basis for tarantula-toxin action remains elusive because of the difficulty of capturing the functionally relevant form of the toxin-channel complex. Here, we engineered the model sodium channel NaVAb with voltage-shifting mutations and the toxin-binding site of human NaV1.7, an attractive pain target. This mutant chimera enabled us to determine the cryoelectron microscopy (cryo-EM) structure of the channel functionally arrested by tarantula toxin. Our structure reveals a high-affinity resting-state-specific toxin-channel interaction between a key lysine residue that serves as a &quot;stinger&quot; and penetrates a triad of carboxyl groups in the S3-S4 linker of the voltage sensor. By unveiling this high-affinity binding mode, our studies establish a high-resolution channel-docking and resting-state locking mechanism for huwentoxin-IV and provide guidance for developing future resting-state-targeted analgesic drugs.
+GRIN2A	epilepsy	33531473	33499151	33457012	33420383	33370585	The effects of different forms of monosaccharides on the brain remain unclear, though neuropsychiatric disorders undergo changes in glucose metabolism. This study assessed cell viability responses to five commonly consumed monosaccharides-D-ribose (RIB), D-glucose, D-mannose (MAN), D-xylose and L-arabinose-in cultured neuro-2a cells. Markedly decreased cell viability was observed in cells treated with RIB and MAN. We then showed that high-dose administration of RIB induced depressive- and anxiety-like behavior as well as spatial memory impairment in mice, while high-dose administration of MAN induced anxiety-like behavior and spatial memory impairment only. Moreover, significant pathological changes were observed in the hippocampus of high-dose RIB-treated mice by hematoxylin-eosin staining. Association analysis of the metabolome and transcriptome suggested that the anxiety-like behavior and spatial memory impairment induced by RIB and MAN may be attributed to the changes in four metabolites and 81 genes in the hippocampus, which is involved in amino acid metabolism and serotonin transport. In addition, combined with previous genome-wide association studies on depression, a correlation was found between the levels of Tnni3k and Tbx1 in the hippocampus and RIB induced depressive-like behavior. Finally, metabolite-gene network, qRT-PCR and western blot analysis showed that the insulin-POMC-MEK-TCF7L2 and MAPK-CREB-GRIN2A-CaMKII signaling pathways were respectively associated with RIB and MAN induced depressive/anxiety-like behavior and spatial memory impairment. Our findings clarified our understanding of the biological mechanisms underlying RIB and MAN induced depressive/anxiety-like behavior and spatial memory impairment in mice and highlighted the deleterious effects of high-dose RIB and MAN as long-term energy sources.	Elite rugby league and union have some of the highest reported rates of concussion (mild traumatic brain injury) in professional sport due in part to their full-contact high-velocity collision-based nature. Currently, concussions are the most commonly reported match injury during the tackle for both the ball carrier and the tackler (8-28 concussions per 1000 player match hours) and reports exist of reduced cognitive function and long-term health consequences that can end a playing career and produce continued ill health. Concussion is a complex phenotype, influenced by environmental factors and an individual's genetic predisposition. This article reviews concussion incidence within elite rugby and addresses the biomechanics and pathophysiology of concussion and how genetic predisposition may influence incidence, severity and outcome. Associations have been reported between a variety of genetic variants and traumatic brain injury. However, little effort has been devoted to the study of genetic associations with concussion within elite rugby players. Due to a growing understanding of the molecular characteristics underpinning the pathophysiology of concussion, investigating genetic variation within elite rugby is a viable and worthy proposition. Therefore, we propose from this review that several genetic variants within or near candidate genes of interest, namely APOE, MAPT, IL6R, COMT, SLC6A4, 5-HTTLPR, DRD2, DRD4, ANKK1, BDNF and GRIN2A, warrant further study within elite rugby and other sports involving high-velocity collisions.	Advanced gastric signet-ring cell carcinoma (SRCC) is a specific type of malignant gastric cancer (GC) with distinct poorer survival. Claudin18.2 (CLDN18.2) is a promising neo-biomarker for the treatment of GC. Clinical trials of CLDN18.2-targeted antibody and T cell-based immunotherapy providing promising prospects for the treatment of GC. The effect of antibody therapy depended on the expression rate of CLDN18.2 has been found in clinical trials. This study aimed to determine the prevalence and the therapeutic value of CLDN18.2 in advanced gastric SRCC. Expression of CLDN18.2 in 105 formalin-fixed, paraffin-embedded (FFPE) tumor tissues was detected by immunohistochemistry (IHC) and evaluated according to FAST criteria. Next-generation sequencing (NGS) using 416 pan-cancer genes panel was performed to characterize the genomic landscape in 61 advanced gastric SRCC patients. Fisher's exact test was used to determine gene differences in different CLDN18.2 expression levels. A total number of 105 advanced gastric SRCC samples were analyzed, of which 95.2% (100/105) were positive stained. Moderate-to-strong CLDN18.2 expression was observed in 64.8% (68/105) of all samples. In particularly, 21.0% (22/105) samples had positive staining in more than 90% tumor cells. No significance was found between CLDN18.2 expression and overall survival (OS). NGS results showed that single nucleotide variations (SNVs) could be frequently found in TP53 (26.2%), CDH1 (19.7%), MED12 (18.0%), PKHD1 (18.0%) and ARID1A (11.5%), besides, copy number variations (CNVs) were rich in NOTCH1 (18.0%) and FLT4 (9.8%) in SRCC samples. Moreover, SNVs in GRIN2A was found in 20% of the patients who had CLDN18.2 staining in &lt;40% of tumor cells (P=0.043), indicating CLDN18.2 expression might be related to the aberration of GRIN2A in advanced gastric SRCC. The highly expressed CLDN18.2 among advanced gastric SRCC patients that we found certified the value of CLDN18.2-targeted therapy in this specific type of GC. In addition, Analyses between CLDN18.2 expression and genetic abnormalities provided novel therapeutic options for advanced gastric SRCC.	The NMDA receptor-mediated Ca<sup>2+</sup> signaling during simultaneous pre- and postsynaptic activity is critically involved in synaptic plasticity and thus has a key role in the nervous system. In GRIN2-variant patients alterations of this coincidence detection provoked complex clinical phenotypes, ranging from reduced muscle strength to epileptic seizures and intellectual disability. By using our gene-targeted mouse line (Grin2a<sup>N615S</sup>), we show that voltage-independent glutamate-gated signaling of GluN2A-containing NMDA receptors is associated with NMDAR-dependent audiogenic seizures due to hyperexcitable midbrain circuits. In contrast, the NMDAR antagonist MK-801-induced c-Fos expression is reduced in the hippocampus. Likewise, the synchronization of theta- and gamma oscillatory activity is lowered during exploration, demonstrating reduced hippocampal activity. This is associated with exploratory hyperactivity and aberrantly increased and dysregulated levels of attention that can interfere with associative learning, in particular when relevant cues and reward outcomes are disconnected in space and time. Together, our findings provide (i) experimental evidence that the inherent voltage-dependent Ca<sup>2+</sup> signaling of NMDA receptors is essential for maintaining appropriate responses to sensory stimuli and (ii) a mechanistic explanation for the neurological manifestations seen in the NMDAR-related human disorders with GRIN2 variant-meidiated intellectual disability and focal epilepsy.	Evidence suggested the crucial roles of brain-derived neurotrophic factor (BDNF) and glutamate system functioning in the antidepressant mechanisms of low-dose ketamine infusion in treatment-resistant depression (TRD). 65 patients with TRD were genotyped for 684,616 single nucleotide polymorphisms (SNPs). Twelve ketamine-related genes were selected for the gene-based genome-wide association study on the antidepressant effect of ketamine infusion and the resulting serum ketamine and norketamine levels. Specific SNPs and whole genes involved in BDNF-TrkB signaling (i.e., rs2049048 in BDNF and rs10217777 in NTRK2) and the glutamatergic and GABAergic systems (i.e., rs16966731 in GRIN2A) were associated with the rapid (within 240 min) and persistent (up to 2 weeks) antidepressant effect of low-dose ketamine infusion and with serum ketamine and norketamine levels. Our findings confirmed the predictive roles of BDNF-TrkB signaling and glutamatergic and GABAergic systems in the underlying mechanisms of low-dose ketamine infusion for TRD treatment.
+ANKRD11	autism	33527450	33476899	33354850	33262785	33179249	To characterize the genetic alterations in adult primary uterine rhabdomyosarcomas (uRMSs) and to investigate whether these tumors are genetically distinct from uterine carcinosarcomas (UCSs). Three tumors originally diagnosed as primary adult pleomorphic uRMS were subjected to massively parallel sequencing targeting 468 cancer-related genes and RNA-sequencing. Mutational profiles were compared to those from UCSs (n=57) obtained from The Cancer Genome Atlas. Sequencing data analyses were performed using validated bioinformatic approaches. Pathogenic TP53 mutations and high levels of genomic instability were detected in the three cases. uRMS1 harbored a likely pathogenic YTHDF2-FOXR1 fusion gene. uRMS2 displayed a PPP2R1A hotspot mutation and amplification of multiple genes, including WHSC1L1, FGFR1, MDM2 and CCNE1, whereas uRMS3 harbored an FBXW7 hotspot mutation and an ANKRD11 homozygous deletion. Hierarchical clustering of somatic mutations and copy number alterations revealed that these tumors initially diagnosed as pleomorphic uRMSs and UCSs were similar. Subsequent comprehensive pathologic re-review of the three uRMSs revealed previously un-identified minute pan-cytokeratin-positive atypical glands in one case (uRMS3), favoring its reclassification as UCS with extensive rhabdomyosarcomatous overgrowth. Adult pleomorphic uRMSs harbor TP53 mutations and high levels of copy number alterations. Our findings underscore the challenge in discriminating between uRMS and UCS with rhabdomyosarcomatous differentiation.	NA	KBG syndrome is a rare genetic disease characterized mainly by skeletal abnormalities, distinctive facial features, and intellectual disability. Heterozygous mutations in ANKRD11 gene, or deletion of 16q24.3 that includes ANKRD11 gene are the cause of KBG syndrome. We describe two patients presenting with short stature and partial facial features, whereas no intellectual disability or hearing loss was observed in them. Two ANKRD11 variants, c.4039_4041del (p. Lys1347del) and c.6427C &gt; G (p. Leu2143Val), were identified in this study. Both of them were classified as variants of uncertain significance (VOUS) by ACMG/AMP guidelines and were inherited from their mothers. ANKRD11 could enhance the transactivation of p21 gene, which was identified to participate in chondrogenic differentiation. In this study, we demonstrated that the knockdown of ANKRD11 could reduce the p21-promoter luciferase activities while re-introduction of wild type ANKRD11, but not ANKRD11 variants (p. Lys1347del or p. Leu2143Val), could restore the p21 levels. Thus, our study report two loss-of-function ANKRD11 variants which might provide new insight on pathogenic mechanism that correlates ANKRD11 variants with the short stature phenotype of KBG syndrome.	KBG syndrome (OMIM #148050) is a rare, autosomal dominant inherited genetic disorder caused by heterozygous mutations in the ankyrin repeat domain-containing protein 11 (ANKRD11) gene or by microdeletion of chromosome 16q24.3. It is characterized by macrodontia of the upper central incisors, distinctive facial dysmorphism, short stature, vertebral abnormalities, hand anomaly including clinodactyly, and various degrees of developmental delay. KBG syndrome presents with variable clinical feature and severity among individuals. Here, we report two KBG patients who have different novel heterozygous mutations of ANKRD11 gene with wide range of clinical manifestations. Two novel heterozygous mutations of ANKRD11 gene were identified in two unrelated Korean patients with variable clinical presentations. The first patient presented with short stature and early puberty and was treated with growth hormone and gonadotropin-releasing hormone agonist without adverse effects. He had mild intellectual disability. In targeted exome sequencing, a novel de novo frameshift variant was identified in ANKRD11, c.5889del, and p. (Ile1963MetfsX9). The second patient had severe intellectual disability with epilepsy. He had normal height and prepubertal stage at the age of 11 years. He had behavioral problems such as autism-like features, anxiety, and stereotypical movements. Whole exome sequencing (WES) was performed, and the novel heterozygous mutation, c3310dup, p. (Glu110GlyfsTer5) in ANKRD11 was identified. KBG syndrome is often underdiagnosed because of its non-specific features and phenotypic variability. Performing a next-generation sequencing panel, including the ANKRD11 gene for cases of developmental delay with/without short stature may be helpful to identify hitherto undiagnosed KBG syndrome patients.	Neurodevelopmental disorders (NDDs) are a heterogeneous group of conditions including intellectual disability, global developmental delay, autism spectrum disorder, and attention deficit hyperactivity disorder. Advances in genetic diagnostic technology have led to the identification of a number of NDD-associated genes, but reports of cognitive and developmental outcomes in affected individuals have been variable. The objective of this scoping review is to synthesize available information pertaining to the developmental outcomes of individuals with pathogenic variants in ten emerging recurrent NDD-associated genes identified from large scale sequencing studies; ADNP, ANKRD11, ARID1B, CHD2, CHD8, CTNNB1, DDX3X, DYRK1A, SCN2A, and SYNGAP1. After a comprehensive search, 260 articles were selected that reported on neurodevelopmental measures or diagnoses. We identify the spectrum of developmental outcomes for each genetic NDD, including prevalence of intellectual disability, frequency of co-morbid NDDs such as ADHD and autism, and commonly reported medical issues that can help inform diagnosis and treatment. There are significant gaps in our understanding of the natural history of these conditions. Future research focusing on barriers to assessment, the development of modified assessment tools appropriate for long-term outcomes in genetic NDD, and collection of longitudinal data will increase understanding of prognosis in these conditions and inform evaluations of treatment.
+SHANK2	autism	33547379	33515293	33491217	33483523	33383702	West Nile virus (WNV) is a Flavivirus, which can cause febrile illness in humans that may progress to encephalitis. Like any other obligate intracellular pathogens, Flaviviruses hijack cellular protein functions as a strategy for sustaining their life cycle. Many cellular proteins display globular domain known as PDZ domain that interacts with PDZ-Binding Motifs (PBM) identified in many viral proteins. Thus, cellular PDZ-containing proteins are common targets during viral infection. The non-structural protein 5 (NS5) from WNV provides both RNA cap methyltransferase and RNA polymerase activities and is involved in viral replication but its interactions with host proteins remain poorly known. In this study, we demonstrate that the C-terminal PBM of WNV NS5 recognizes several human PDZ-containing proteins using both in vitro and in cellulo high-throughput methods. Furthermore, we constructed and assayed in cell culture WNV replicons where the PBM within NS5 was mutated. Our results demonstrate that the PBM of WNV NS5 is important in WNV replication. Moreover, we show that knockdown of the PDZ-containing proteins TJP1, PARD3, ARHGAP21 or SHANK2 results in the decrease of WNV replication in cells. Altogether, our data reveal that interactions between the PBM of NS5 and PDZ-containing proteins affect West Nile virus replication.	Olfaction supports a multitude of behaviors vital for social communication and interactions between conspecifics. Intact sensory processing is contingent upon proper circuit wiring. Disturbances in genetic factors controlling circuit assembly and synaptic wiring can lead to neurodevelopmental disorders, such as autism spectrum disorder (ASD), where impaired social interactions and communication are core symptoms. The variability in behavioral phenotype expression is also contingent upon the role environmental factors play in defining genetic expression. Considering the prevailing clinical diagnosis of ASD, research on therapeutic targets for autism is essential. Behavioral impairments may be identified along a range of increasingly complex social tasks. Hence, the assessment of social behavior and communication is progressing towards more ethologically relevant tasks. Garnering a more accurate understanding of social processing deficits in the sensory domain may greatly contribute to the development of therapeutic targets. With that framework, studies have found a viable link between social behaviors, circuit wiring, and altered neuronal coding related to the processing of salient social stimuli. Here, the relationship between social odor processing in rodents and humans is examined in the context of health and ASD, with special consideration for how genetic expression and neuronal connectivity may regulate behavioral phenotypes.	Impairments in social relationships and awareness are features observed in autism spectrum disorders (ASDs). However, the underlying mechanisms remain poorly understood. Shank2 is a high-confidence ASD candidate gene and localizes primarily to postsynaptic densities (PSDs) of excitatory synapses in the central nervous system (CNS). We show here that loss of Shank2 in mice leads to a lack of social attachment and bonding behavior towards pubs independent of hormonal, cognitive, or sensitive deficits. Shank2<sup>-/-</sup> mice display functional changes in nuclei of the social attachment circuit that were most prominent in the medial preoptic area (MPOA) of the hypothalamus. Selective enhancement of MPOA activity by DREADD technology re-established social bonding behavior in Shank2<sup>-/-</sup> mice, providing evidence that the identified circuit might be crucial for explaining how social deficits in ASD can arise.	SHANK2 mutations have been identified in individuals with neurodevelopmental disorders, including intellectual disability and autism spectrum disorders (ASD). Using CRISPR/Cas9 genome editing, we obtained SH-SY5Y cell lines with frameshift mutations on one or both SHANK2 alleles. We investigated the effects of the different SHANK2 mutations on cell morphology, cell proliferation and differentiation potential during early neuronal differentiation. All mutant cell lines showed impaired neuronal differentiation marker expression. Cells with bi-allelic SHANK2 mutations revealed diminished apoptosis and increased proliferation, as well as decreased neurite outgrowth during early neuronal differentiation. Bi-allelic SHANK2 mutations resulted in an increase in p-AKT levels, suggesting that SHANK2 mutations impair downstream signaling of tyrosine kinase receptors. Additionally, cells with bi-allelic SHANK2 mutations had lower amyloid precursor protein (APP) expression compared to controls, suggesting a molecular link between SHANK2 and APP. Together, we can show that frameshift mutations on one or both SHANK2 alleles lead to an alteration of neuronal differentiation in SH-SY5Y cells, characterized by changes in cell growth and pre- and postsynaptic protein expression. We also provide first evidence that downstream signaling of tyrosine kinase receptors and amyloid precursor protein expression are affected.	Autism spectrum disorder (ASD) is a heterogeneous condition with a complex genetic etiology. The objective of this study is to identify the complex genetic factors that underlie the ASD phenotype and other clinical features of Professor Temple Grandin, an animal scientist and woman with high-functioning ASD. Identifying the underlying genetic cause for ASD can impact medical management, personalize services and treatment, and uncover other medical risks that are associated with the genetic diagnosis. Prof. Grandin underwent chromosomal microarray analysis, whole exome sequencing, and whole genome sequencing, as well as a comprehensive clinical and family history intake. The raw data were analyzed in order to identify possible genotype-phenotype correlations. Genetic testing identified variants in three genes (SHANK2, ALX1, and RELN) that are candidate risk factors for ASD. We identified variants in MEFV and WNT10A, reported to be disease-associated in previous studies, which are likely to contribute to some of her additional clinical features. Moreover, candidate variants in genes encoding metabolic enzymes and transporters were identified, some of which suggest potential therapies. This case report describes the genomic findings in Prof. Grandin and it serves as an example to discuss state-of-the-art clinical diagnostics for individuals with ASD, as well as the medical, logistical, and economic hurdles that are involved in clinical genetic testing for an individual on the autism spectrum.
+POGZ	autism	33377604	33334860	33277917	33203851	33155545	NA	Efficient genetic manipulation in the developing central nervous system is crucial for investigating mechanisms of neurodevelopmental disorders and the development of promising therapeutics. Common approaches including transgenic mice and in utero electroporation, although powerful in many aspects, have their own limitations. In this study, we delivered vectors based on the AAV9.PHP.eB pseudo-type to the fetal mouse brain, and achieved widespread and extensive transduction of neural cells. When AAV9.PHP.eB-coding gRNA targeting PogZ or Depdc5 was delivered to Cas9 transgenic mice, widespread gene knockout was also achieved at the whole brain level. Our studies provide a useful platform for studying brain development and devising genetic intervention for severe developmental diseases.	White-Sutton syndrome is a rare developmental disorder characterized by global developmental delay, intellectual disabilities (ID), and neurobehavioral abnormalities secondary to pathogenic pogo transposable element-derived protein with zinc finger domain (POGZ) variants. The purpose of our study was to describe the neurocognitive phenotype of an unbiased national cohort of patients with identified POGZ pathogenic variants. This study is based on a French collaboration through the AnDDI-Rares network, and includes 19 patients from 18 families with POGZ pathogenic variants. All clinical data and neuropsychological tests were collected from medical files. Among the 19 patients, 14 patients exhibited ID (six mild, five moderate and three severe). The five remaining patients had learning disabilities and shared a similar neurocognitive profile, including language difficulties, dysexecutive syndrome, attention disorders, slowness, and social difficulties. One patient evaluated for autism was found to have moderate autism spectrum disorder. This study reveals that the cognitive phenotype of patients with POGZ pathogenic variants can range from learning disabilities to severe ID. It highlights that pathogenic variations in the same genes can be reported in a large spectrum of neurocognitive profiles, and that children with learning disabilities could benefit from next generation sequencing techniques.	Several genes implicated in autism spectrum disorder (ASD) are chromatin regulators, including POGZ. The cellular and molecular mechanisms leading to ASD impaired social and cognitive behavior are unclear. Animal models are crucial for studying the effects of mutations on brain function and behavior as well as unveiling the underlying mechanisms. Here, we generate a brain specific conditional knockout mouse model deficient for Pogz, an ASD risk gene. We demonstrate that Pogz deficient mice show microcephaly, growth impairment, increased sociability, learning and motor deficits, mimicking several of the human symptoms. At the molecular level, luciferase reporter assay indicates that POGZ is a negative regulator of transcription. In accordance, in Pogz deficient mice we find a significant upregulation of gene expression, most notably in the cerebellum. Gene set enrichment analysis revealed that the transcriptional changes encompass genes and pathways disrupted in ASD, including neurogenesis and synaptic processes, underlying the observed behavioral phenotype in mice. Physiologically, Pogz deficiency is associated with a reduction in the firing frequency of simple and complex spikes and an increase in amplitude of the inhibitory synaptic input in cerebellar Purkinje cells. Our findings support a mechanism linking heterochromatin dysregulation to cerebellar circuit dysfunction and behavioral abnormalities in ASD.	Many genes have been linked to autism. However, it remains unclear what long-term changes in neural circuitry result from disruptions in these genes, and how these circuit changes might contribute to abnormal behaviors. To address these questions, we studied behavior and physiology in mice heterozygous for Pogz, a high confidence autism gene. Pogz<sup>+/-</sup> mice exhibit reduced anxiety-related avoidance in the elevated plus maze (EPM). Theta-frequency communication between the ventral hippocampus (vHPC) and medial prefrontal cortex (mPFC) is known to be necessary for normal avoidance in the EPM. We found deficient theta-frequency synchronization between the vHPC and mPFC in vivo. When we examined vHPC-mPFC communication at higher resolution, vHPC input onto prefrontal GABAergic interneurons was specifically disrupted, whereas input onto pyramidal neurons remained intact. These findings illustrate how the loss of a high confidence autism gene can impair long-range communication by causing inhibitory circuit dysfunction within pathways important for specific behaviors.
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/pmids_to_pubtator_matrix_output	Wed Mar 24 08:33:56 2021 +0000
@@ -0,0 +1,7 @@
+5-httlpr	adnp	akt	alx1	amyloid precursor protein	ankk1	ankrd11	ankyrin repeat domain-containing protein 11	apoe	arhgap21	arid1a	arid1b	asah1	atrx	bdnf	brain-derived neurotrophic factor	c-fos	cacna1a	cacna1h	camkii	ccne1	cdh1	chd2	chd8	clcn2	cldn18	comt	creb	ctnnb1	ddx3x	depdc5	drd2	drd4	dyrk1a	egfr	fbxw7	fgfr1	flt4	foxr1	glun2a	gonadotropin-releasing hormone	grin2	grin2a	growth hormone	il6r	itch	kcnq2	leb	mapt	mdm2	mecp2	med12	mefv	mek	nav1.5	nav1.7	nestin	nf1	nlrp5	notch1	ns5	ntrk2	p21	p75	pard3	pik3ca	pkhd1	pogz	pomc	ppp2r1a	pten	reln	runx1	scn10a	scn1a	scn2a	scn8a	scn9a	shank2	slc2a1	slc6a4	sox10	sox2	syngap1	tbx1	tcf7l2	tjp1	tnni3k	tp53	trkb	trpa1	trpv1	tsc1	tsc2	whsc1l1	wnt10a	ythdf2
+0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0
+0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	0	1	1	1	1	1	0	0	0	0	1	0	1	0	0	0	0	1	0	1	1	1	1	0	1	0	0	0	1	1	0	0	0	0	0	1	0	1	1	0	1	0	0	0
+1	0	0	0	0	1	0	0	1	0	1	0	0	0	1	1	1	0	0	1	0	1	0	0	0	1	1	1	0	0	0	1	1	0	0	0	0	1	0	1	0	1	1	0	1	0	0	0	1	0	0	1	0	1	0	0	0	0	0	1	0	1	0	0	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	1	0	1	1	1	0	0	0	0	0	0	0
+0	1	0	0	0	0	1	1	0	0	0	1	0	0	0	0	0	0	0	0	1	0	1	1	0	0	0	0	1	1	0	0	0	1	0	1	1	0	1	0	1	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	1	0	1
+0	0	1	1	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0
+0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/pmids_to_pubtator_matrix_output_byid	Wed Mar 24 08:33:56 2021 +0000
@@ -0,0 +1,7 @@
+ADNP	AKT	ALX1	ANKRD11	APOE	ARID1A	ARID1B	ATRX	BDNF	CACNA1A	CCNE1	CDH1	CHD2	CHD8	CLDN18	COMT	CTNNB1	DDX3X	DEPDC5	DRD2	DYRK1A	Depdc5	EGFR	FGFR1	FLT4	FOXR1	GRIN2	GRIN2A	GluN2A	IL6R	LEB	MAPT	MDM2	MED12	MEFV	NF1	NLRP5	NOTCH1	Nav1.5	Nav1.7	Nestin	PIK3CA	PKHD1	POGZ	PPP2R1A	PTEN	Pogz	RELN	SCN1A	SCN2A	SCN9A	SHANK2	SLC6A4	SYNGAP1	Shank2	Sox10	Sox2	TP53	TSC2	TrkB	WHSC1L1	WNT10A	YTHDF2	amyloid precursor protein	c-Fos	gonadotropin-releasing hormone	growth hormone	itch	p21	p75
+0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
+0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0	0	1	0	0	0	0	1	1	0	1	1	1	1	0	0	0	1	0	0	1	1	1	0	0	0	0	1	1	1	1	0	0	0	0	0	0	0	0	1	0	1
+0	0	0	0	1	1	0	0	1	0	0	1	0	0	1	1	0	0	0	1	0	0	0	0	1	0	1	1	1	1	0	1	0	1	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1	0	1	0	0	0	0	1	0	0	0	0	0
+1	0	0	1	0	0	1	0	0	0	1	0	1	1	0	0	1	1	0	0	1	0	0	1	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0	1	0	0	0	1	0	0	1	0	1	0	0	1	1	0	1	0
+0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	1	0	0	0	0	0	0	1	0	1	0	0	0	0	0	0
+0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/pmids_to_pubtator_matrix_output_number	Wed Mar 24 08:33:56 2021 +0000
@@ -0,0 +1,7 @@
+amyloid precursor protein	ankrd11	anxiety	asah1	asd	autism	bdnf	cldn18	dravet syndrome	embryonic kidney	epilepsy	gastric srcc	itch	kbg syndrome	learning disabilities	memory impairment	nav1.7	ns5	p21	pain	pogz	scn1a	scn2a	scn9a	shank2	short stature	tumors	white-sutton syndrome
+0	0	0	1	0	0	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0
+0	0	0	0	0	0	0	0	0	1	0	0	1	0	0	0	1	0	0	1	0	0	0	1	0	0	0	0
+0	0	1	0	0	0	1	1	0	0	0	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
+0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	1	1	0
+1	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0
+0	0	1	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	1
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/pubmed_by_queries_output	Wed Mar 24 08:33:56 2021 +0000
@@ -0,0 +1,7 @@
+ID_gene	GROUPING_disease	PMID_1	PMID_2	PMID_3	PMID_4	PMID_5
+SCN1A	epilepsy	33565071	33531663	33528079	33519675	33478845
+SCN9A	epilepsy	33389681	33370834	33278787	33237934	33232657
+GRIN2A	epilepsy	33531473	33499151	33457012	33420383	33370585
+ANKRD11	autism	33527450	33476899	33354850	33262785	33179249
+SHANK2	autism	33547379	33515293	33491217	33483523	33383702
+POGZ	autism	33377604	33334860	33277917	33203851	33155545
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/pubmed_by_queries_output_abstracts	Wed Mar 24 08:33:56 2021 +0000
@@ -0,0 +1,7 @@
+ID_gene	GROUPING_disease	ABSTRACT_1	ABSTRACT_2	ABSTRACT_3	ABSTRACT_4	ABSTRACT_5
+SCN1A	epilepsy	[Analysis of SCN1A gene variants among patients with Dravet syndrome]. To analyze the clinical features and genetic variants in two patients with Dravet syndrome (DS). Peripheral blood samples of the children and their parents were collected for the extraction of genomic DNA and high-throughput sequencing. Suspected variants were confirmed by Sanger sequencing. By high-throughput sequencing, the two children were found to respectively harbor a c.2135delC frameshifting variant in exon 12 and a c.1522G&gt;T nonsense variant in exon 10 of the SCN1A gene. Both variants were predicted to be pathogenic by bioinformatic analysis. Based on the American College of Medical Genetics and Genomics standards and guidelines, the c.2135delC and c.1522G&gt;A variants of the SCN1A gene were predicted to be pathogenic (PVS1+ PS2+ PM2+ PP3). The variants of the SCN1A gene probably underlay the DS in the patients. Above finding has enriched the variant spectrum and enabled genetic counseling for their families.	Sodium channelopathies in neurodevelopmental disorders. The voltage-gated sodium channel α-subunit genes comprise a highly conserved gene family. Mutations of three of these genes, SCN1A, SCN2A and SCN8A, are responsible for a significant burden of neurological disease. Recent progress in identification and functional characterization of patient variants is generating new insights and novel approaches to therapy for these devastating disorders. Here we review the basic elements of sodium channel function that are used to characterize patient variants. We summarize a large body of work using global and conditional mouse mutants to characterize the in vivo roles of these channels. We provide an overview of the neurological disorders associated with mutations of the human genes and examples of the effects of patient mutations on channel function. Finally, we highlight therapeutic interventions that are emerging from new insights into mechanisms of sodium channelopathies.	Customized Targeted Massively Parallel Sequencing Enables More Precisely Diagnosis of Patients with Epilepsy. Advancement in genetic technology has led to the identification of an increasing number of genes in epilepsy. This will provide a huge information in clinical practice and improve diagnosis and treatment of epilepsy. this was a single-center retrospective cohort study of 80 patients who underwent NGS testing with customize epilepsy panel. In total 54 out of 80 patients (67, 5%), pathogenic / likely pathogenic and variants of uncertain significance variants were identified according to ACMG criteria. Pathogenic or likely pathogenic variants (n=35) were identified in 29 out of 80 individuals (36.25%). Variants of uncertain significance (VOUS) (n=34) have identified in 28 out of 80 patients (35%). Pathogenic, likely pathogenic, and variants of uncertain significance (VOUS) were most frequently identified in TSC2 (n = 11), SCN1A (n = 6) and TSC1 (n = 5) genes. Other common genes were KCNQ2 (n = 3), AMT (n = 3), CACNA1H (n = 3), CLCN2 (n = 3), MECP2 (n = 2), ASAH1 (n = 2) and SLC2A1 (n = 2). NGS based testing panels contributes the diagnosis of epilepsy and may change the clinical management by preventing unnecessary and potentially harmful diagnostic procedures and management in patients. Thus, our results highlighted the benefit of genetic testing in children suffered with epilepsy. This article is protected by copyright. All rights reserved.	Association Between SCN1A rs2298771, SCN1A rs10188577, SCN2A rs17183814, and SCN2A rs2304016 Polymorphisms and Responsiveness to Antiepileptic Drugs: A Meta-Analysis. Background:SCN1A and SCN2A genes have been reported to be associated with the efficacy of single and combined antiepileptic therapy, but the results remain contradictory. Previous meta-analyses on this topic mainly focused on the SCN1A rs3812718 polymorphism. However, meta-analyses focused on SCN1A rs2298771, SCN1A rs10188577, SCN2A rs17183814, or SCN2A rs2304016 polymorphisms are scarce or non-existent. Objective: We aimed to conduct a meta-analysis to determine the effects of SCN1A rs2298771, SCN1A rs10188577, SCN2A rs17183814, and SCN2A rs2304016 polymorphisms on resistance to antiepileptic drugs (AEDs). Methods: We searched the PubMed, Embase, Cochrane Library, WANFANG, and CNKI databases up to June 2020 to collect studies on the association of SCN1A and SCN2A polymorphisms with reactivity to AEDs. We calculated the pooled odds ratios (ORs) under the allelic, homozygous, heterozygous, dominant, and recessive genetic models to identify the association between the four single-nucleotide polymorphisms (SNPs) and resistance to AEDs. Results: Our meta-analysis included 19 eligible studies. The results showed that the SCN1A rs2298771 polymorphism was related to AED resistance in the allelic, homozygous, and recessive genetic models (G vs. A: OR = 1.20, 95% CI: 1.012-1.424; GG vs. AA: OR = 1.567, 95% CI: 1.147-2.142; GG vs. AA + AG: OR = 1.408, 95% CI: 1.053-1.882). The homozygous model remained significant after Bonferroni correction (P &lt; 0.0125). Further subgroup analyses demonstrated the significance of the correlation in the dominant model in Caucasians (South Asians) after Bonferroni correction (GG + GA vs. AA: OR = 1.620, 95% CI: 1.165-2.252). However, no association between SCN1A rs2298771 polymorphism and resistance to AEDs was found in Asians or Caucasians (non-South Asians). For SCN1A rs10188577, SCN2A rs17183814, and SCN2A rs2304016 polymorphisms, the correlations with responsiveness to AEDs were not significant in the overall population nor in any subgroup after conducting the Bonferroni correction. The results for SCN1A rs2298771, SCN1A rs10188577, and SCN2A rs2304016 polymorphisms were stable and reliable according to sensitivity analysis and Begg and Egger tests. However, the results for SCN2A rs17183814 polymorphism have to be treated cautiously owing to the significant publication bias revealed by Begg and Egger tests. Conclusions: The present meta-analysis indicated that SCN1A rs2298771 polymorphism significantly affects resistance to AEDs in the overall population and Caucasians (South Asians). There were no significant correlations between SCN1A rs10188577, SCN2A rs17183814, and SCN2A rs2304016 polymorphisms and resistance to AEDs.	Multicenter prospective longitudinal study in 34 patients with Dravet syndrome: Neuropsychological development in the first six years of life. The objective of this study was to identify developmental trajectories of developmental/behavioral phenotypes and possibly their relationship to epilepsy and genotype by analyzing developmental and behavioral features collected prospectively and longitudinally in a cohort of patients with Dravet syndrome (DS). Thirty-four patients from seven Italian tertiary pediatric neurology centers were enrolled in the study. All patients were examined for the SCN1A gene mutation and prospectively assessed from the first years of life with repeated full clinical observations including neurological and developmental examinations. Subjects were found to follow three neurodevelopmental trajectories. In the first group (16 patients), an initial and usually mild decline was observed between the second and the third year of life, specifically concerning visuomotor abilities, later progressing towards global involvement of all abilities. The second group (12 patients) showed an earlier onset of global developmental impairment, progressing towards a generally worse outcome. The third group of only two patients ended up with a normal neurodevelopmental quotient, but with behavioral and linguistic problems. The remaining four patients were not classifiable due to a lack of critical assessments just before developmental decline. The neurodevelopmental trajectories described in this study suggest a differential contribution of neurobiological and genetic factors. The profile of the first group, which included the largest fraction of patients, suggests that in the initial phase of the disease, visuomotor defects might play a major role in determining developmental decline. Early diagnosis of milder cases with initial visuomotor impairment may therefore provide new tools for a more accurate habilitation strategy.
+SCN9A	epilepsy	Satellite Glial Cells Give Rise to Nociceptive Sensory Neurons. Dorsal root ganglia (DRG) sensory neurons can transmit information about noxious stimulus to cerebral cortex via spinal cord, and play an important role in the pain pathway. Alterations of the pain pathway lead to CIPA (congenital insensitivity to pain with anhidrosis) or chronic pain. Accumulating evidence demonstrates that nerve damage leads to the regeneration of neurons in DRG, which may contribute to pain modulation in feedback. Therefore, exploring the regeneration process of DRG neurons would provide a new understanding to the persistent pathological stimulation and contribute to reshape the somatosensory function. It has been reported that a subpopulation of satellite glial cells (SGCs) express Nestin and p75, and could differentiate into glial cells and neurons, suggesting that SGCs may have differentiation plasticity. Our results in the present study show that DRG-derived SGCs (DRG-SGCs) highly express neural crest cell markers Nestin, Sox2, Sox10, and p75, and differentiate into nociceptive sensory neurons in the presence of histone deacetylase inhibitor VPA, Wnt pathway activator CHIR99021, Notch pathway inhibitor RO4929097, and FGF pathway inhibitor SU5402. The nociceptive sensory neurons express multiple functionally-related genes (SCN9A, SCN10A, SP, Trpv1, and TrpA1) and are able to generate action potentials and voltage-gated Na<sup>+</sup> currents. Moreover, we found that these cells exhibited rapid calcium transients in response to capsaicin through binding to the Trpv1 vanilloid receptor, confirming that the DRG-SGC-derived cells are nociceptive sensory neurons. Further, we show that Wnt signaling promotes the differentiation of DRG-SGCs into nociceptive sensory neurons by regulating the expression of specific transcription factor Runx1, while Notch and FGF signaling pathways are involved in the expression of SCN9A. These results demonstrate that DRG-SGCs have stem cell characteristics and can efficiently differentiate into functional nociceptive sensory neurons, shedding light on the clinical treatment of sensory neuron-related diseases.	Computer-aided Discovery of a New Nav1.7 Inhibitor for Treatment of Pain and Itch. Voltage-gated sodium channel Nav1.7 has been validated as a perspective target for selective inhibitors with analgesic and anti-itch activity. The objective of this study was to discover new candidate compounds with Nav1.7 inhibitor properties. The authors hypothesized that their approach would yield at least one new compound that inhibits sodium currents in vitro and exerts analgesic and anti-itch effects in mice. In silico structure-based similarity search of 1.5 million compounds followed by docking to the Nav1.7 voltage sensor of Domain 4 and molecular dynamics simulation was performed. Patch clamp experiments in Nav1.7-expressing human embryonic kidney 293 cells and in mouse and human dorsal root ganglion neurons were conducted to test sodium current inhibition. Formalin-induced inflammatory pain model, paclitaxel-induced neuropathic pain model, histamine-induced itch model, and mouse lymphoma model of chronic itch were used to confirm in vivo activity of the selected compound. After in silico screening, nine compounds were selected for experimental assessment in vitro. Of those, four compounds inhibited sodium currents in Nav1.7-expressing human embryonic kidney 293 cells by 29% or greater (P &lt; 0.05). Compound 9 (3-(1-benzyl-1H-indol-3-yl)-3-(3-phenoxyphenyl)-N-(2-(pyrrolidin-1-yl)ethyl)propanamide, referred to as DA-0218) reduced sodium current by 80% with a 50% inhibition concentration of 0.74 μM (95% CI, 0.35 to 1.56 μM), but had no effects on Nav1.5-expressing human embryonic kidney 293 cells. In mouse and human dorsal root ganglion neurons, DA-0218 reduced sodium currents by 17% (95% CI, 6 to 28%) and 22% (95% CI, 9 to 35%), respectively. The inhibition was greatly potentiated in paclitaxel-treated mouse neurons. Intraperitoneal and intrathecal administration of the compound reduced formalin-induced phase II inflammatory pain behavior in mice by 76% (95% CI, 48 to 100%) and 80% (95% CI, 68 to 92%), respectively. Intrathecal administration of DA-0218 produced acute reduction in paclitaxel-induced mechanical allodynia, and inhibited histamine-induced acute itch and lymphoma-induced chronic itch. This study's computer-aided drug discovery approach yielded a new Nav1.7 inhibitor that shows analgesic and anti-pruritic activity in mouse models.	High genetic burden in 163 Chinese children with status epilepticus. This study aimed to investigate the genetic aetiology in Chinese children diagnosed with status epilepticus (SE). Next-generation sequencing, copy number variation (CNV) analysis, and other genetic testing methods were conducted for children with SE lacking an identifiable non-genetic aetiology. Furthermore, the phenotype and molecular data of patients with SE were retrospectively analysed. Among children with SE lacking an identifiable non-genetic aetiology, 73 out of 163 children (44.8 %) were found to have causative variants associated with SE including 66 monogenic mutations in 22 genes and 7 CNVs. Based on the American College of Medical Genetics and Genomics scoring system, the monogenic variants included 64 pathogenic/likely pathogenic and 2 uncertain significance variants. SCN1A gene mutations (n = 32) were the most common cause, followed by TSC2 (n = 5), CACNA1A (n = 5), SCN2A (n = 4), SCN9A (n = 2) and DEPDC5 (n = 2) gene mutations. Sixteen mutations were identified in single genes. Furthermore, 51 (77.3 %) monogenic mutations were de novo. Age at SE onset &lt; 1 year (odds ratio [OR] = 2.70, 95 % confidence interval [CI]: 1.25-5.83, p = 0.012) and co-morbidity of intellectual disability (OR = 3.36, 95 %CI: 1.61-6.99, p = 0.001) were independently associated with pathogenic genetic variants. This study identified genetic aetiology in 44.8 % of patients with SE, which indicates a high burden of genetic aetiology among children with SE in China. Our findings highlight the importance for genetic testing of children with SE that lacks an identifiable non-genetic aetiology.	Correlation of genetic alterations by whole-exome sequencing with clinical outcomes of glioblastoma patients from the Lebanese population. Glioblastoma (GBM) is an aggressive brain tumor associated with high degree of resistance to treatment. Given its heterogeneity, it is important to understand the molecular landscape of this tumor for the development of more effective therapies. Because of the different genetic profiles of patients with GBM, we sought to identify genetic variants in Lebanese patients with GBM (LEB-GBM) and compare our findings to those in the Cancer Genome Atlas (TCGA). We performed whole exome sequencing (WES) to identify somatic variants in a cohort of 60 patient-derived GBM samples. We focused our analysis on 50 commonly mutated GBM candidate genes and compared mutation signatures between our population and publicly available GBM data from TCGA. We also cross-tabulated biological covariates to assess for associations with overall survival, time to recurrence and follow-up duration. We included 60 patient-derived GBM samples from 37 males and 23 females, with age ranging from 3 to 80 years (mean and median age at diagnosis were 51 and 56, respectively). Recurrent tumor formation was present in 94.8% of patients (n = 55/58). After filtering, we identified 360 somatic variants from 60 GBM patient samples. After filtering, we identified 360 somatic variants from 60 GBM patient samples. Most frequently mutated genes in our samples included ATRX, PCDHX11, PTEN, TP53, NF1, EGFR, PIK3CA, and SCN9A. Mutations in NLRP5 were associated with decreased overall survival among the Lebanese GBM cohort (p = 0.002). Mutations in NLRP5 were associated with decreased overall survival among the Lebanese GBM cohort (p = 0.002). EGFR and NF1 mutations were associated with the frontal lobe and temporal lobe in our LEB-GBM cohort, respectively. Our WES analysis confirmed the similarity in mutation signature of the LEB-GBM population with TCGA cohorts. It showed that 1 out of the 50 commonly GBM candidate gene mutations is associated with decreased overall survival among the Lebanese cohort. This study also highlights the need for studies with larger sample sizes to inform clinicians for better prognostication and management of Lebanese patients with GBM.	Structural Basis for High-Affinity Trapping of the NaV1.7 Channel in Its Resting State by Tarantula Toxin. Voltage-gated sodium channels initiate electrical signals and are frequently targeted by deadly gating-modifier neurotoxins, including tarantula toxins, which trap the voltage sensor in its resting state. The structural basis for tarantula-toxin action remains elusive because of the difficulty of capturing the functionally relevant form of the toxin-channel complex. Here, we engineered the model sodium channel NaVAb with voltage-shifting mutations and the toxin-binding site of human NaV1.7, an attractive pain target. This mutant chimera enabled us to determine the cryoelectron microscopy (cryo-EM) structure of the channel functionally arrested by tarantula toxin. Our structure reveals a high-affinity resting-state-specific toxin-channel interaction between a key lysine residue that serves as a &quot;stinger&quot; and penetrates a triad of carboxyl groups in the S3-S4 linker of the voltage sensor. By unveiling this high-affinity binding mode, our studies establish a high-resolution channel-docking and resting-state locking mechanism for huwentoxin-IV and provide guidance for developing future resting-state-targeted analgesic drugs.
+GRIN2A	epilepsy	Chronic D-ribose and D-mannose overload induce depressive/anxiety-like behavior and spatial memory impairment in mice. The effects of different forms of monosaccharides on the brain remain unclear, though neuropsychiatric disorders undergo changes in glucose metabolism. This study assessed cell viability responses to five commonly consumed monosaccharides-D-ribose (RIB), D-glucose, D-mannose (MAN), D-xylose and L-arabinose-in cultured neuro-2a cells. Markedly decreased cell viability was observed in cells treated with RIB and MAN. We then showed that high-dose administration of RIB induced depressive- and anxiety-like behavior as well as spatial memory impairment in mice, while high-dose administration of MAN induced anxiety-like behavior and spatial memory impairment only. Moreover, significant pathological changes were observed in the hippocampus of high-dose RIB-treated mice by hematoxylin-eosin staining. Association analysis of the metabolome and transcriptome suggested that the anxiety-like behavior and spatial memory impairment induced by RIB and MAN may be attributed to the changes in four metabolites and 81 genes in the hippocampus, which is involved in amino acid metabolism and serotonin transport. In addition, combined with previous genome-wide association studies on depression, a correlation was found between the levels of Tnni3k and Tbx1 in the hippocampus and RIB induced depressive-like behavior. Finally, metabolite-gene network, qRT-PCR and western blot analysis showed that the insulin-POMC-MEK-TCF7L2 and MAPK-CREB-GRIN2A-CaMKII signaling pathways were respectively associated with RIB and MAN induced depressive/anxiety-like behavior and spatial memory impairment. Our findings clarified our understanding of the biological mechanisms underlying RIB and MAN induced depressive/anxiety-like behavior and spatial memory impairment in mice and highlighted the deleterious effects of high-dose RIB and MAN as long-term energy sources.	Genetic Factors That Could Affect Concussion Risk in Elite Rugby. Elite rugby league and union have some of the highest reported rates of concussion (mild traumatic brain injury) in professional sport due in part to their full-contact high-velocity collision-based nature. Currently, concussions are the most commonly reported match injury during the tackle for both the ball carrier and the tackler (8-28 concussions per 1000 player match hours) and reports exist of reduced cognitive function and long-term health consequences that can end a playing career and produce continued ill health. Concussion is a complex phenotype, influenced by environmental factors and an individual's genetic predisposition. This article reviews concussion incidence within elite rugby and addresses the biomechanics and pathophysiology of concussion and how genetic predisposition may influence incidence, severity and outcome. Associations have been reported between a variety of genetic variants and traumatic brain injury. However, little effort has been devoted to the study of genetic associations with concussion within elite rugby players. Due to a growing understanding of the molecular characteristics underpinning the pathophysiology of concussion, investigating genetic variation within elite rugby is a viable and worthy proposition. Therefore, we propose from this review that several genetic variants within or near candidate genes of interest, namely APOE, MAPT, IL6R, COMT, SLC6A4, 5-HTTLPR, DRD2, DRD4, ANKK1, BDNF and GRIN2A, warrant further study within elite rugby and other sports involving high-velocity collisions.	Highly expressed Claudin18.2 as a potential therapeutic target in advanced gastric signet-ring cell carcinoma (SRCC). Advanced gastric signet-ring cell carcinoma (SRCC) is a specific type of malignant gastric cancer (GC) with distinct poorer survival. Claudin18.2 (CLDN18.2) is a promising neo-biomarker for the treatment of GC. Clinical trials of CLDN18.2-targeted antibody and T cell-based immunotherapy providing promising prospects for the treatment of GC. The effect of antibody therapy depended on the expression rate of CLDN18.2 has been found in clinical trials. This study aimed to determine the prevalence and the therapeutic value of CLDN18.2 in advanced gastric SRCC. Expression of CLDN18.2 in 105 formalin-fixed, paraffin-embedded (FFPE) tumor tissues was detected by immunohistochemistry (IHC) and evaluated according to FAST criteria. Next-generation sequencing (NGS) using 416 pan-cancer genes panel was performed to characterize the genomic landscape in 61 advanced gastric SRCC patients. Fisher's exact test was used to determine gene differences in different CLDN18.2 expression levels. A total number of 105 advanced gastric SRCC samples were analyzed, of which 95.2% (100/105) were positive stained. Moderate-to-strong CLDN18.2 expression was observed in 64.8% (68/105) of all samples. In particularly, 21.0% (22/105) samples had positive staining in more than 90% tumor cells. No significance was found between CLDN18.2 expression and overall survival (OS). NGS results showed that single nucleotide variations (SNVs) could be frequently found in TP53 (26.2%), CDH1 (19.7%), MED12 (18.0%), PKHD1 (18.0%) and ARID1A (11.5%), besides, copy number variations (CNVs) were rich in NOTCH1 (18.0%) and FLT4 (9.8%) in SRCC samples. Moreover, SNVs in GRIN2A was found in 20% of the patients who had CLDN18.2 staining in &lt;40% of tumor cells (P=0.043), indicating CLDN18.2 expression might be related to the aberration of GRIN2A in advanced gastric SRCC. The highly expressed CLDN18.2 among advanced gastric SRCC patients that we found certified the value of CLDN18.2-targeted therapy in this specific type of GC. In addition, Analyses between CLDN18.2 expression and genetic abnormalities provided novel therapeutic options for advanced gastric SRCC.	Voltage-independent GluN2A-type NMDA receptor Ca<sup>2+</sup> signaling promotes audiogenic seizures, attentional and cognitive deficits in mice. The NMDA receptor-mediated Ca<sup>2+</sup> signaling during simultaneous pre- and postsynaptic activity is critically involved in synaptic plasticity and thus has a key role in the nervous system. In GRIN2-variant patients alterations of this coincidence detection provoked complex clinical phenotypes, ranging from reduced muscle strength to epileptic seizures and intellectual disability. By using our gene-targeted mouse line (Grin2a<sup>N615S</sup>), we show that voltage-independent glutamate-gated signaling of GluN2A-containing NMDA receptors is associated with NMDAR-dependent audiogenic seizures due to hyperexcitable midbrain circuits. In contrast, the NMDAR antagonist MK-801-induced c-Fos expression is reduced in the hippocampus. Likewise, the synchronization of theta- and gamma oscillatory activity is lowered during exploration, demonstrating reduced hippocampal activity. This is associated with exploratory hyperactivity and aberrantly increased and dysregulated levels of attention that can interfere with associative learning, in particular when relevant cues and reward outcomes are disconnected in space and time. Together, our findings provide (i) experimental evidence that the inherent voltage-dependent Ca<sup>2+</sup> signaling of NMDA receptors is essential for maintaining appropriate responses to sensory stimuli and (ii) a mechanistic explanation for the neurological manifestations seen in the NMDAR-related human disorders with GRIN2 variant-meidiated intellectual disability and focal epilepsy.	Treatment response to low-dose ketamine infusion for treatment-resistant depression: A gene-based genome-wide association study. Evidence suggested the crucial roles of brain-derived neurotrophic factor (BDNF) and glutamate system functioning in the antidepressant mechanisms of low-dose ketamine infusion in treatment-resistant depression (TRD). 65 patients with TRD were genotyped for 684,616 single nucleotide polymorphisms (SNPs). Twelve ketamine-related genes were selected for the gene-based genome-wide association study on the antidepressant effect of ketamine infusion and the resulting serum ketamine and norketamine levels. Specific SNPs and whole genes involved in BDNF-TrkB signaling (i.e., rs2049048 in BDNF and rs10217777 in NTRK2) and the glutamatergic and GABAergic systems (i.e., rs16966731 in GRIN2A) were associated with the rapid (within 240 min) and persistent (up to 2 weeks) antidepressant effect of low-dose ketamine infusion and with serum ketamine and norketamine levels. Our findings confirmed the predictive roles of BDNF-TrkB signaling and glutamatergic and GABAergic systems in the underlying mechanisms of low-dose ketamine infusion for TRD treatment.
+ANKRD11	autism	Genetic characterization of adult primary pleomorphic uterine rhabdomyosarcoma and comparison with uterine carcinosarcoma. To characterize the genetic alterations in adult primary uterine rhabdomyosarcomas (uRMSs) and to investigate whether these tumors are genetically distinct from uterine carcinosarcomas (UCSs). Three tumors originally diagnosed as primary adult pleomorphic uRMS were subjected to massively parallel sequencing targeting 468 cancer-related genes and RNA-sequencing. Mutational profiles were compared to those from UCSs (n=57) obtained from The Cancer Genome Atlas. Sequencing data analyses were performed using validated bioinformatic approaches. Pathogenic TP53 mutations and high levels of genomic instability were detected in the three cases. uRMS1 harbored a likely pathogenic YTHDF2-FOXR1 fusion gene. uRMS2 displayed a PPP2R1A hotspot mutation and amplification of multiple genes, including WHSC1L1, FGFR1, MDM2 and CCNE1, whereas uRMS3 harbored an FBXW7 hotspot mutation and an ANKRD11 homozygous deletion. Hierarchical clustering of somatic mutations and copy number alterations revealed that these tumors initially diagnosed as pleomorphic uRMSs and UCSs were similar. Subsequent comprehensive pathologic re-review of the three uRMSs revealed previously un-identified minute pan-cytokeratin-positive atypical glands in one case (uRMS3), favoring its reclassification as UCS with extensive rhabdomyosarcomatous overgrowth. Adult pleomorphic uRMSs harbor TP53 mutations and high levels of copy number alterations. Our findings underscore the challenge in discriminating between uRMS and UCS with rhabdomyosarcomatous differentiation.	Electroclinical features and outcome of ANKRD11-related KBG syndrome: A novel report and literature review. NA	Two loss-of-function ANKRD11 variants in Chinese patients with short stature and a possible molecular pathway. KBG syndrome is a rare genetic disease characterized mainly by skeletal abnormalities, distinctive facial features, and intellectual disability. Heterozygous mutations in ANKRD11 gene, or deletion of 16q24.3 that includes ANKRD11 gene are the cause of KBG syndrome. We describe two patients presenting with short stature and partial facial features, whereas no intellectual disability or hearing loss was observed in them. Two ANKRD11 variants, c.4039_4041del (p. Lys1347del) and c.6427C &gt; G (p. Leu2143Val), were identified in this study. Both of them were classified as variants of uncertain significance (VOUS) by ACMG/AMP guidelines and were inherited from their mothers. ANKRD11 could enhance the transactivation of p21 gene, which was identified to participate in chondrogenic differentiation. In this study, we demonstrated that the knockdown of ANKRD11 could reduce the p21-promoter luciferase activities while re-introduction of wild type ANKRD11, but not ANKRD11 variants (p. Lys1347del or p. Leu2143Val), could restore the p21 levels. Thus, our study report two loss-of-function ANKRD11 variants which might provide new insight on pathogenic mechanism that correlates ANKRD11 variants with the short stature phenotype of KBG syndrome.	Two Novel Mutations of ANKRD11 Gene and Wide Clinical Spectrum in KBG Syndrome: Case Reports and Literature Review. KBG syndrome (OMIM #148050) is a rare, autosomal dominant inherited genetic disorder caused by heterozygous mutations in the ankyrin repeat domain-containing protein 11 (ANKRD11) gene or by microdeletion of chromosome 16q24.3. It is characterized by macrodontia of the upper central incisors, distinctive facial dysmorphism, short stature, vertebral abnormalities, hand anomaly including clinodactyly, and various degrees of developmental delay. KBG syndrome presents with variable clinical feature and severity among individuals. Here, we report two KBG patients who have different novel heterozygous mutations of ANKRD11 gene with wide range of clinical manifestations. Two novel heterozygous mutations of ANKRD11 gene were identified in two unrelated Korean patients with variable clinical presentations. The first patient presented with short stature and early puberty and was treated with growth hormone and gonadotropin-releasing hormone agonist without adverse effects. He had mild intellectual disability. In targeted exome sequencing, a novel de novo frameshift variant was identified in ANKRD11, c.5889del, and p. (Ile1963MetfsX9). The second patient had severe intellectual disability with epilepsy. He had normal height and prepubertal stage at the age of 11 years. He had behavioral problems such as autism-like features, anxiety, and stereotypical movements. Whole exome sequencing (WES) was performed, and the novel heterozygous mutation, c3310dup, p. (Glu110GlyfsTer5) in ANKRD11 was identified. KBG syndrome is often underdiagnosed because of its non-specific features and phenotypic variability. Performing a next-generation sequencing panel, including the ANKRD11 gene for cases of developmental delay with/without short stature may be helpful to identify hitherto undiagnosed KBG syndrome patients.	Description of neurodevelopmental phenotypes associated with 10 genetic neurodevelopmental disorders: A scoping review. Neurodevelopmental disorders (NDDs) are a heterogeneous group of conditions including intellectual disability, global developmental delay, autism spectrum disorder, and attention deficit hyperactivity disorder. Advances in genetic diagnostic technology have led to the identification of a number of NDD-associated genes, but reports of cognitive and developmental outcomes in affected individuals have been variable. The objective of this scoping review is to synthesize available information pertaining to the developmental outcomes of individuals with pathogenic variants in ten emerging recurrent NDD-associated genes identified from large scale sequencing studies; ADNP, ANKRD11, ARID1B, CHD2, CHD8, CTNNB1, DDX3X, DYRK1A, SCN2A, and SYNGAP1. After a comprehensive search, 260 articles were selected that reported on neurodevelopmental measures or diagnoses. We identify the spectrum of developmental outcomes for each genetic NDD, including prevalence of intellectual disability, frequency of co-morbid NDDs such as ADHD and autism, and commonly reported medical issues that can help inform diagnosis and treatment. There are significant gaps in our understanding of the natural history of these conditions. Future research focusing on barriers to assessment, the development of modified assessment tools appropriate for long-term outcomes in genetic NDD, and collection of longitudinal data will increase understanding of prognosis in these conditions and inform evaluations of treatment.
+SHANK2	autism	Role of PDZ-binding motif from West Nile virus NS5 protein on viral replication. West Nile virus (WNV) is a Flavivirus, which can cause febrile illness in humans that may progress to encephalitis. Like any other obligate intracellular pathogens, Flaviviruses hijack cellular protein functions as a strategy for sustaining their life cycle. Many cellular proteins display globular domain known as PDZ domain that interacts with PDZ-Binding Motifs (PBM) identified in many viral proteins. Thus, cellular PDZ-containing proteins are common targets during viral infection. The non-structural protein 5 (NS5) from WNV provides both RNA cap methyltransferase and RNA polymerase activities and is involved in viral replication but its interactions with host proteins remain poorly known. In this study, we demonstrate that the C-terminal PBM of WNV NS5 recognizes several human PDZ-containing proteins using both in vitro and in cellulo high-throughput methods. Furthermore, we constructed and assayed in cell culture WNV replicons where the PBM within NS5 was mutated. Our results demonstrate that the PBM of WNV NS5 is important in WNV replication. Moreover, we show that knockdown of the PDZ-containing proteins TJP1, PARD3, ARHGAP21 or SHANK2 results in the decrease of WNV replication in cells. Altogether, our data reveal that interactions between the PBM of NS5 and PDZ-containing proteins affect West Nile virus replication.	Genetic influences of autism candidate genes on circuit wiring and olfactory decoding. Olfaction supports a multitude of behaviors vital for social communication and interactions between conspecifics. Intact sensory processing is contingent upon proper circuit wiring. Disturbances in genetic factors controlling circuit assembly and synaptic wiring can lead to neurodevelopmental disorders, such as autism spectrum disorder (ASD), where impaired social interactions and communication are core symptoms. The variability in behavioral phenotype expression is also contingent upon the role environmental factors play in defining genetic expression. Considering the prevailing clinical diagnosis of ASD, research on therapeutic targets for autism is essential. Behavioral impairments may be identified along a range of increasingly complex social tasks. Hence, the assessment of social behavior and communication is progressing towards more ethologically relevant tasks. Garnering a more accurate understanding of social processing deficits in the sensory domain may greatly contribute to the development of therapeutic targets. With that framework, studies have found a viable link between social behaviors, circuit wiring, and altered neuronal coding related to the processing of salient social stimuli. Here, the relationship between social odor processing in rodents and humans is examined in the context of health and ASD, with special consideration for how genetic expression and neuronal connectivity may regulate behavioral phenotypes.	Activation of the medial preoptic area (MPOA) ameliorates loss of maternal behavior in a Shank2 mouse model for autism. Impairments in social relationships and awareness are features observed in autism spectrum disorders (ASDs). However, the underlying mechanisms remain poorly understood. Shank2 is a high-confidence ASD candidate gene and localizes primarily to postsynaptic densities (PSDs) of excitatory synapses in the central nervous system (CNS). We show here that loss of Shank2 in mice leads to a lack of social attachment and bonding behavior towards pubs independent of hormonal, cognitive, or sensitive deficits. Shank2<sup>-/-</sup> mice display functional changes in nuclei of the social attachment circuit that were most prominent in the medial preoptic area (MPOA) of the hypothalamus. Selective enhancement of MPOA activity by DREADD technology re-established social bonding behavior in Shank2<sup>-/-</sup> mice, providing evidence that the identified circuit might be crucial for explaining how social deficits in ASD can arise.	SHANK2 mutations impair apoptosis, proliferation and neurite outgrowth during early neuronal differentiation in SH-SY5Y cells. SHANK2 mutations have been identified in individuals with neurodevelopmental disorders, including intellectual disability and autism spectrum disorders (ASD). Using CRISPR/Cas9 genome editing, we obtained SH-SY5Y cell lines with frameshift mutations on one or both SHANK2 alleles. We investigated the effects of the different SHANK2 mutations on cell morphology, cell proliferation and differentiation potential during early neuronal differentiation. All mutant cell lines showed impaired neuronal differentiation marker expression. Cells with bi-allelic SHANK2 mutations revealed diminished apoptosis and increased proliferation, as well as decreased neurite outgrowth during early neuronal differentiation. Bi-allelic SHANK2 mutations resulted in an increase in p-AKT levels, suggesting that SHANK2 mutations impair downstream signaling of tyrosine kinase receptors. Additionally, cells with bi-allelic SHANK2 mutations had lower amyloid precursor protein (APP) expression compared to controls, suggesting a molecular link between SHANK2 and APP. Together, we can show that frameshift mutations on one or both SHANK2 alleles lead to an alteration of neuronal differentiation in SH-SY5Y cells, characterized by changes in cell growth and pre- and postsynaptic protein expression. We also provide first evidence that downstream signaling of tyrosine kinase receptors and amyloid precursor protein expression are affected.	The Temple Grandin Genome: Comprehensive Analysis in a Scientist with High-Functioning Autism. Autism spectrum disorder (ASD) is a heterogeneous condition with a complex genetic etiology. The objective of this study is to identify the complex genetic factors that underlie the ASD phenotype and other clinical features of Professor Temple Grandin, an animal scientist and woman with high-functioning ASD. Identifying the underlying genetic cause for ASD can impact medical management, personalize services and treatment, and uncover other medical risks that are associated with the genetic diagnosis. Prof. Grandin underwent chromosomal microarray analysis, whole exome sequencing, and whole genome sequencing, as well as a comprehensive clinical and family history intake. The raw data were analyzed in order to identify possible genotype-phenotype correlations. Genetic testing identified variants in three genes (SHANK2, ALX1, and RELN) that are candidate risk factors for ASD. We identified variants in MEFV and WNT10A, reported to be disease-associated in previous studies, which are likely to contribute to some of her additional clinical features. Moreover, candidate variants in genes encoding metabolic enzymes and transporters were identified, some of which suggest potential therapies. This case report describes the genomic findings in Prof. Grandin and it serves as an example to discuss state-of-the-art clinical diagnostics for individuals with ASD, as well as the medical, logistical, and economic hurdles that are involved in clinical genetic testing for an individual on the autism spectrum.
+POGZ	autism	A case of White-Sutton syndrome with previously described loss-of-function variant in DDE domain of POGZ (p.Arg1211*) and Kartagener syndrome. NA	Widespread labeling and genomic editing of the fetal central nervous system by in utero CRISPR AAV9-PHP.eB administration. Efficient genetic manipulation in the developing central nervous system is crucial for investigating mechanisms of neurodevelopmental disorders and the development of promising therapeutics. Common approaches including transgenic mice and in utero electroporation, although powerful in many aspects, have their own limitations. In this study, we delivered vectors based on the AAV9.PHP.eB pseudo-type to the fetal mouse brain, and achieved widespread and extensive transduction of neural cells. When AAV9.PHP.eB-coding gRNA targeting PogZ or Depdc5 was delivered to Cas9 transgenic mice, widespread gene knockout was also achieved at the whole brain level. Our studies provide a useful platform for studying brain development and devising genetic intervention for severe developmental diseases.	Neuropsychological study in 19 French patients with White-Sutton syndrome and POGZ mutations. White-Sutton syndrome is a rare developmental disorder characterized by global developmental delay, intellectual disabilities (ID), and neurobehavioral abnormalities secondary to pathogenic pogo transposable element-derived protein with zinc finger domain (POGZ) variants. The purpose of our study was to describe the neurocognitive phenotype of an unbiased national cohort of patients with identified POGZ pathogenic variants. This study is based on a French collaboration through the AnDDI-Rares network, and includes 19 patients from 18 families with POGZ pathogenic variants. All clinical data and neuropsychological tests were collected from medical files. Among the 19 patients, 14 patients exhibited ID (six mild, five moderate and three severe). The five remaining patients had learning disabilities and shared a similar neurocognitive profile, including language difficulties, dysexecutive syndrome, attention disorders, slowness, and social difficulties. One patient evaluated for autism was found to have moderate autism spectrum disorder. This study reveals that the cognitive phenotype of patients with POGZ pathogenic variants can range from learning disabilities to severe ID. It highlights that pathogenic variations in the same genes can be reported in a large spectrum of neurocognitive profiles, and that children with learning disabilities could benefit from next generation sequencing techniques.	Pogz deficiency leads to transcription dysregulation and impaired cerebellar activity underlying autism-like behavior in mice. Several genes implicated in autism spectrum disorder (ASD) are chromatin regulators, including POGZ. The cellular and molecular mechanisms leading to ASD impaired social and cognitive behavior are unclear. Animal models are crucial for studying the effects of mutations on brain function and behavior as well as unveiling the underlying mechanisms. Here, we generate a brain specific conditional knockout mouse model deficient for Pogz, an ASD risk gene. We demonstrate that Pogz deficient mice show microcephaly, growth impairment, increased sociability, learning and motor deficits, mimicking several of the human symptoms. At the molecular level, luciferase reporter assay indicates that POGZ is a negative regulator of transcription. In accordance, in Pogz deficient mice we find a significant upregulation of gene expression, most notably in the cerebellum. Gene set enrichment analysis revealed that the transcriptional changes encompass genes and pathways disrupted in ASD, including neurogenesis and synaptic processes, underlying the observed behavioral phenotype in mice. Physiologically, Pogz deficiency is associated with a reduction in the firing frequency of simple and complex spikes and an increase in amplitude of the inhibitory synaptic input in cerebellar Purkinje cells. Our findings support a mechanism linking heterochromatin dysregulation to cerebellar circuit dysfunction and behavioral abnormalities in ASD.	Altered hippocampal-prefrontal communication during anxiety-related avoidance in mice deficient for the autism-associated gene Pogz. Many genes have been linked to autism. However, it remains unclear what long-term changes in neural circuitry result from disruptions in these genes, and how these circuit changes might contribute to abnormal behaviors. To address these questions, we studied behavior and physiology in mice heterozygous for Pogz, a high confidence autism gene. Pogz<sup>+/-</sup> mice exhibit reduced anxiety-related avoidance in the elevated plus maze (EPM). Theta-frequency communication between the ventral hippocampus (vHPC) and medial prefrontal cortex (mPFC) is known to be necessary for normal avoidance in the EPM. We found deficient theta-frequency synchronization between the vHPC and mPFC in vivo. When we examined vHPC-mPFC communication at higher resolution, vHPC input onto prefrontal GABAergic interneurons was specifically disrupted, whereas input onto pyramidal neurons remained intact. These findings illustrate how the loss of a high confidence autism gene can impair long-range communication by causing inhibitory circuit dysfunction within pathways important for specific behaviors.
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/test_data	Wed Mar 24 08:33:56 2021 +0000
@@ -0,0 +1,7 @@
+ID_gene	GROUPING_disease
+SCN1A	epilepsy
+SCN9A	epilepsy
+GRIN2A	epilepsy
+ANKRD11	autism
+SHANK2	autism
+POGZ	autism
\ No newline at end of file
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/text_to_wordmatrix_output	Wed Mar 24 08:33:56 2021 +0000
@@ -0,0 +1,7 @@
+scn1a	patient	variant	scn2a	gene	polymorphism	genetic	pathogenic	aeds	epilepsy	rs2298771	developmental	resistance	result	rs10188577	rs17183814	rs2304016	significant	study	asian	association	channel	clinical	diagnosis	dravet	first	group	identified	metaanalysis	model	mutation	neurodevelopmental	n  3	sequencing	significance	sodium	syndrome	analysis	antiepileptic	bonferroni	caucasian	child	correction	correlation	decline	disorder	found	global	homozygous	however	gbm	neuron	cell	pain	compound	sensory	associated	inhibitor	mouse	nav17	aetiology	cohort	current	human	lebanese	nociceptive	pathway	sample	among	itch	new	analgesic	overall	respectively	scn9a	survival	293	activity	age	candidate	chronic	da0218	decreased	differentiate	dorsal	drg	drgsgcs	embryonic	cldn182	concussion	gastric	srcc	advanced	expression	rib	behavior	ketamine	man	elite	impairment	induced	memory	rugby	signaling	spatial	within	effect	infusion	level	grin2a	highdose	hippocampus	lowdose	nmda	reduced	system	treatment	180	antidepressant	anxietylike	bdnf	brain	change	depression	depressiveanxietylike	due	ankrd11	kbg	two	disability	feature	intellectual	novel	short	stature	heterozygous	including	outcome	report	adult	case	pleomorphic	review	urmss	uterine	alteration	condition	delay	facial	individual	number	primary	spectrum	three	tumor	ucss	variable	16q243	abnormalities	assessment	autism	shank2	protein	social	asd	neuronal	wnv	circuit	differentiation	ns5	can	pbm	replication	factor	grandin	interaction	may	pdzcontaining	processing	viral	wiring	behavioral	biallelic	cellular	communication	complex	deficit	domain	early	genome	lead	medical	mpoa	nile	phenotype	pogz	deficient	disabilities	learning	mechanism	avoidance	cerebellar	input	neurocognitive	severe	underlying	vhpc	whitesutton	widespread	aav9phpeb	achieved	anxietyrelated	based	central	cognitive	confidence	crucial	deficiency	delivered	development	difficulties	disrupted
+1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
+0	1	1	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	0	0	0	0	1	0	1	1	0	0	0	0	1	0	1	0	0	0	1	0	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
+0	1	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	1	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
+0	1	1	0	1	0	1	1	0	0	0	1	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	1	0	0	1	1	0	1	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
+0	0	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
+0	1	1	0	1	0	0	1	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1	0	1	0	0	0	1	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/text_to_wordmatrix_output_args	Wed Mar 24 08:33:56 2021 +0000
@@ -0,0 +1,7 @@
+the	and	scna	patients	were	with	variants	n  	for	pathogenic	polymorphisms	genetic	genes	epilepsy	aeds	gene	results	this	resistance	developmental	significant	that	study	was	between	polymorphism	clinical	dravet	sequencing	syndrome	are	neurodevelopmental	sodium	diagnosis	identified	significance	asians	association	metaanalysis	first	group	analysis	children	found	their	two	variant	channel	disorders	from	neurons	gbm	our	pain	mutations	cells	sensory	nav	associated	inhibitor	nociceptive	human	aetiology	lebanese	new	pathway	itch	model	mouse	among	cohort	samples	currents	into	analgesic	compound	compounds	respectively	overall	survival	cldn	gastric	srcc	rib	advanced	expression	behavior	man	concussion	ketamine	impairment	induced	memory	signaling	spatial	elite	rugby	within	levels	mice	infusion	cell	highdose	hippocampus	grina	reduced	treatment	nmda	lowdose	anxietylike	brain	ankrd	kbg	urms	novel	disability	intellectual	short	stature	including	features	heterozygous	adult	pleomorphic	these	urmss	uterine	review	had	outcomes	alterations	mutation	number	primary	three	tumors	ucss	shank	social	asd	autism	proteins	wnv	neuronal	protein	circuit	differentiation	can	pbm	replication	spectrum	both	during	interactions	may	pdzcontaining	viral	candidate	factors	processing	wiring	grandin	cellular	domain	nile	other	show	targets	virus	west	pogz	deficient	disabilities	learning	communication	whitesutton	have	mechanisms	severe	widespread	disorder	neurocognitive	phenotype	cerebellar	changes	input	underlying	avoidance	vhpc	aavphpeb	achieved
+1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
+1	1	1	1	1	1	1	0	1	0	0	1	1	0	0	0	0	1	0	0	0	1	1	1	0	0	0	0	0	0	1	0	1	0	1	0	0	0	0	0	0	0	1	0	0	0	0	1	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
+1	1	0	1	1	1	0	0	1	0	0	1	1	0	0	0	0	1	0	0	0	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
+1	1	0	1	1	1	1	0	0	1	0	1	1	0	0	1	0	0	0	1	0	1	0	1	0	0	1	0	1	1	1	1	0	0	1	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
+1	1	0	0	0	1	0	0	1	0	0	1	0	0	0	0	0	1	0	0	0	1	0	0	1	0	1	0	0	0	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
+1	1	0	1	0	1	1	0	1	1	0	0	1	0	0	1	0	1	0	1	0	1	1	1	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	1	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test/commands_tests	Wed Mar 24 08:33:56 2021 +0000
@@ -0,0 +1,37 @@
+#commands to test the tools with "test_data" 
+
+ $ cd <path>/simtext
+
+ $ Rscript pubmed_by_queries.R --input "test-data/test_data" --output "test-data/pubmed_by_queries_output"
+ #output: test-data/pubmed_by_queries_output --install_packages
+
+ $ Rscript pubmed_by_queries.R --input "test-data/test_data" --abstract --output "test-data/pubmed_by_queries_output_abstracts" --install_packages
+ #output: test-data/pubmed_by_queries_output_abstracts
+
+ $ Rscript abstracts_by_pmids.R --input "test-data/pubmed_by_queries_output" --output "test-data/abstracts_by_pmids_output" --install_packages
+ #output: test-data/abstracts_by_pmids_output
+
+ $ Rscript text_to_wordmatrix.R --input "test-data/pubmed_by_queries_output_abstracts" --output "test-data/text_to_wordmatrix_output" --install_packages
+ #output: test-data/text_to_wordmatrix_output
+
+ $ Rscript text_to_wordmatrix.R --input "test-data/pubmed_by_queries_output_abstracts" --output "test-data/text_to_wordmatrix_output_args" --remove_num --remove_stopwords --plurals --install_packages
+ #output: test-data/text_to_wordmatrix_output_args
+ 
+  $ Rscript test-data/pmids_to_pubtator_matrix.R --input "test-data/pubmed_by_queries_output" --output "test-datadata/pmids_to_pubtator_matrix_output" --number 50 --categories Gene Mutation --install_packages
+ #output: test-data/pmids_to_pubtator_matrix_output
+
+  $ Rscript pmids_to_pubtator_matrix.R --input "test-data/pubmed_by_queries_output" --output "test-data/pmids_to_pubtator_matrix_output_byid" --number 50 --categories Gene Disease --install_packages --byid
+ #output: test-data/pmids_to_pubtator_matrix_output_byid
+
+  $ Rscript pmids_to_pubtator_matrix.R --input "test-data/pubmed_by_queries_output" --output "test-data/pmids_to_pubtator_matrix_output_number" --number 5 --categories Gene Disease --install_packages
+ #output: test-data/pmids_to_pubtator_matrix_output_number
+
+ $ Rscript simtext_app.R -i "test-data/test_data" -m "test-data/text_to_wordmatrix_output" --install_packages
+ #output: ShinyApp
+
+ $ Rscript simtext_app.R -i "test-data/test_data" -m "test-data/pmids_to_pubtator_matrix_output" --install_packages
+ #output: ShinyApp
+
+
+
+
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/text_to_wordmatrix.R	Wed Mar 24 08:33:56 2021 +0000
@@ -0,0 +1,106 @@
+#!/usr/bin/env Rscript
+# tool: text_to_wordmatrix
+#
+#The tool extracts the most frequent words per entity (per row). Text of columns starting with "ABSTRACT" or "TEXT" are considered.
+#All extracted terms are used to generate a word matrix with rows = entities and columns = extracted words.
+#The resulting matrix is binary with 0= word not present in abstracts of entity and 1= word present in abstracts of entity.
+#
+#Input: Output of "pubmed_by_queries" or "abstracts_by_pmids", or tab-delimited table with entities in column called “ID_<name>”,
+#e.g. “ID_genes” and text in columns starting with "ABSTRACT" or "TEXT".
+#
+#Output: Binary matrix with rows = entities and columns = extracted words.
+#
+#usage: text_to_wordmatrix.R [-h] [-i INPUT] [-o OUTPUT] [-n NUMBER] [-r] [-l] [-w] [-s] [-p]
+#
+# optional arguments:
+# -h, --help                    show help message
+# -i INPUT, --input INPUT       input file name. add path if file is not in working directory
+# -o OUTPUT, --output OUTPUT    output file name. [default "text_to_wordmatrix_output"]
+# -n NUMBER, --number NUMBER    number of most frequent words that should be extracted [default "50"]
+# -r, --remove_num              remove any numbers in text
+# -l, --lower_case              by default all characters are translated to lower case. otherwise use -l
+# -w, --remove_stopwords        by default a set of english stopwords (e.g., "the" or "not") are removed. otherwise use -w
+# -s, --stemDoc                 apply Porter"s stemming algorithm: collapsing words to a common root to aid comparison of vocabulary
+# -p, --plurals                 by default words in plural and singular are merged to the singular form. otherwise use -p
+
+if ("--install_packages" %in% commandArgs()) {
+  print("Installing packages")
+  if (!require("argparse")) install.packages("argparse", repo = "http://cran.rstudio.com/");
+  if (!require("PubMedWordcloud")) install.packages("PubMedWordcloud", repo = "http://cran.rstudio.com/");
+  if (!require("SnowballC")) install.packages("SnowballC", repo = "http://cran.rstudio.com/");
+  if (!require("textclean")) install.packages("textclean", repo = "http://cran.rstudio.com/");
+  if (!require("SemNetCleaner")) install.packages("SemNetCleaner", repo = "http://cran.rstudio.com/");
+  if (!require("stringi")) install.packages("stringi", repo = "http://cran.rstudio.com/");
+  if (!require("stringr")) install.packages("stringr", repo = "http://cran.rstudio.com/");
+}
+
+suppressPackageStartupMessages(library("argparse"))
+suppressPackageStartupMessages(library("PubMedWordcloud"))
+suppressPackageStartupMessages(library("SnowballC"))
+suppressPackageStartupMessages(library("SemNetCleaner"))
+suppressPackageStartupMessages(library("textclean"))
+suppressPackageStartupMessages(library("stringi"))
+suppressPackageStartupMessages(library("stringr"))
+
+parser <- ArgumentParser()
+parser$add_argument("-i", "--input",
+                    help = "input fie name. add path if file is not in workind directory")
+parser$add_argument("-o", "--output", default = "text_to_wordmatrix_output",
+                    help = "output file name. [default \"%(default)s\"]")
+parser$add_argument("-n", "--number", type = "integer", default = 50, choices = seq(1, 500), metavar = "{0..500}",
+                    help = "number of most frequent words used per ID in word matrix [default \"%(default)s\"]")
+parser$add_argument("-r", "--remove_num", action = "store_true", default = FALSE,
+                    help = "remove any numbers in text")
+parser$add_argument("-l", "--lower_case", action = "store_false", default = TRUE,
+                    help = "by default all characters are translated to lower case. otherwise use -l")
+parser$add_argument("-w", "--remove_stopwords", action = "store_false", default = TRUE,
+                    help = "by default a set of English stopwords (e.g., 'the' or 'not') are removed. otherwise use -s")
+parser$add_argument("-s", "--stemDoc", action = "store_true", default = FALSE,
+                    help = "apply Porter's stemming algorithm: collapsing words to a common root to aid comparison of vocabulary")
+parser$add_argument("-p", "--plurals", action = "store_false", default = TRUE,
+                    help = "by default words in plural and singular are merged to the singular form. otherwise use -p")
+parser$add_argument("--install_packages", action = "store_true", default = FALSE,
+                    help = "If you want to auto install missing required packages.")
+
+args <- parser$parse_args()
+
+
+data <- read.delim(args$input, stringsAsFactors = FALSE, header = TRUE, sep = "\t")
+word_matrix <- data.frame()
+
+text_cols_index <- grep(c("ABSTRACT|TEXT"), names(data))
+
+for (row in seq(nrow(data))) {
+    top_words <- cleanAbstracts(abstracts = data[row, text_cols_index],
+                               rmNum = args$remove_num,
+                               tolw = args$lower_case,
+                               rmWords = args$remove_stopwords,
+                               stemDoc = args$stemDoc)
+
+    top_words$word <- as.character(top_words$word)
+
+    cat("Most frequent words for row", row, " are extracted.", "\n")
+
+    if (args$plurals == TRUE) {
+      top_words$word <- sapply(top_words$word, function(x) {
+        singularize(x)
+        })
+      top_words <- aggregate(freq~word, top_words, sum)
+    }
+
+    top_words <- top_words[order(top_words$freq, decreasing = TRUE), ]
+    top_words$word <- as.character(top_words$word)
+
+    number_extract <- min(args$number, nrow(top_words))
+    word_matrix[row, sapply(1:number_extract, function(x) {
+      paste0(top_words$word[x])
+      })] <- top_words$freq[1:number_extract]
+  }
+
+  word_matrix <- as.matrix(word_matrix)
+  word_matrix[is.na(word_matrix)] <- 0
+  word_matrix <- (word_matrix > 0) * 1  #binary matrix
+
+cat("A matrix with ", nrow(word_matrix), " rows and ", ncol(word_matrix), "columns is generated.", "\n")
+
+write.table(word_matrix, args$output, row.names = FALSE, sep = "\t", quote = FALSE)