Mercurial > repos > iuc > pubmed_by_queries
comparison README.md @ 0:02e46a96e98a draft default tip
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tools/simtext commit 63a5e13cf89cdd209d20749c582ec5b8dde4e208"
author | iuc |
---|---|
date | Wed, 24 Mar 2021 08:34:22 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:02e46a96e98a |
---|---|
1 # SimText | |
2 | |
3 A text mining framework for interactive analysis and visualization of similarities among biomedical entities. | |
4 | |
5 ## Brief overview of tools: | |
6 | |
7 - pubmed_by_queries: | |
8 | |
9 For each search query, PMIDs or abstracts from PubMed are saved. | |
10 | |
11 - abstracts_by_pmids: | |
12 | |
13 For all PMIDs in each row of a table the according abstracts are saved in additional columns. | |
14 | |
15 - text_to_wordmatrix: | |
16 | |
17 The most frequent words of text from each row are extracted and united in one large binary matrix. | |
18 | |
19 - pmids_to_pubtator_matrix: | |
20 | |
21 For PMIDs of each row, scientific words are extracted using PubTator annotations and subsequently united in one large binary matrix. | |
22 | |
23 - simtext_app: | |
24 | |
25 Shiny app with word clouds, dimension reduction plot, dendrogram of hierarchical clustering and table with words and their frequency among the search queries. | |
26 | |
27 ## Set up user credentials on Galaxy | |
28 | |
29 To enable users to set their credentials (NCBI API Key) for this tool, | |
30 make sure the file `config/user_preferences_extra_conf.yml` has the following section: | |
31 | |
32 ``` | |
33 preferences: | |
34 ncbi_account: | |
35 description: NCBI account information | |
36 inputs: | |
37 - name: apikey | |
38 label: NCBI API Key (available from "API Key Management" at https://www.ncbi.nlm.nih.gov/account/settings/) | |
39 type: text | |
40 required: False | |
41 | |
42 ``` | |
43 | |
44 ## Requirements command-line version | |
45 | |
46 - R (version > 4.0.0) | |
47 | |
48 ## Installation command-line version | |
49 | |
50 ``` | |
51 $ mkdir -p <path>/simtext | |
52 $ cd <path>/simtext | |
53 $ git clone https://github.com/dlal-group/simtext | |
54 ``` | |
55 | |
56 ## pubmed_by_queries | |
57 | |
58 This tool uses a set of search queries to download a defined number of abstracts or PMIDs for each search query from PubMed. PubMed's search rules and syntax apply. Users can obtain an API key from the Settings page of their NCBI account (to create an account, visit http://www.ncbi.nlm.nih.gov/account/). If the tool is used as command-line tool the API key is passed as an argument. For usage in Galaxy the API key is added to the Galaxy user-preferences (User/ Preferences/ Manage Information). | |
59 | |
60 Input: | |
61 | |
62 Tab-delimited table with a list of search queries (biomedical entities of interest) in one column. The column header should start with "ID_" (e.g., "ID_gene" if search queries are genes). | |
63 | |
64 Usage: | |
65 ``` | |
66 $ Rscript pubmed_by_queries.R [-h] [-i INPUT] [-o OUTPUT] [-n NUMBER] [-a] [-k KEY] [--install_packages] | |
67 ``` | |
68 | |
69 Optional arguments: | |
70 ``` | |
71 -h, --help show help message | |
72 -i INPUT, --input INPUT input file name. add path if file is not in working directory | |
73 -o OUTPUT, --output OUTPUT output file name [default "pubmed_by_queries_output"] | |
74 -n NUMBER, --number NUMBER number of PMIDs or abstracts to save per ID [default "5"] | |
75 -a, --abstract if abstracts instead of PMIDs should be retrieved use --abstracts | |
76 -k KEY, --key KEY if NCBI API key is available, add it to speed up the download of PubMed data. For usage in Galaxy add the API key to the Galaxy user-preferences (User/ Preferences/ Manage Information). | |
77 --install_packages if you want to auto install missing required packages | |
78 ``` | |
79 | |
80 Output: | |
81 | |
82 A table with additional columns containing PMIDs or abstracts from PubMed. | |
83 | |
84 ## abstracts_by_pmids | |
85 | |
86 This tool retrieves abstracts for a matrix of PMIDs. The abstract text is saved in additional columns. | |
87 | |
88 Input: | |
89 | |
90 Tab-delimited table with rows representing biomedical entities and columns containing the corresponding PMIDs. The names of the PMID columns should start with “PMID_” (e.g., “PMID_1”, “PMID_2” etc.). | |
91 | |
92 Usage: | |
93 ``` | |
94 $ Rscript abstracts_by_pmid.R [-h] [-i INPUT] [-o OUTPUT] | |
95 ``` | |
96 | |
97 Optional arguments: | |
98 ``` | |
99 -h, --help show help message | |
100 -i INPUT, --input INPUT input file name. add path if file is not in working directory | |
101 -o OUTPUT, --output OUTPUT output file name [default "abstracts_by_pmids_output"] | |
102 --install_packages if you want to auto install missing required packages | |
103 ``` | |
104 | |
105 Output: | |
106 | |
107 A table with additional columns containing abstract texts. | |
108 | |
109 ## text_to_wordmatrix | |
110 | |
111 The tool extracts for each row the most frequent words from the text in columns starting with "ABSTRACT" or "TEXT. The extracted words from each row are united in one large binary matrix, with 0= word not frequently occurring in text of that row and 1= word frequently present in text of that row. | |
112 | |
113 Input: | |
114 | |
115 The output of ‘pubmed_by_queries’ or ‘abstracts_by_pmids’ tools, or a tab-delimited table with text in columns starting with "ABSTRACT" or "TEXT". | |
116 | |
117 Usage: | |
118 ``` | |
119 $ Rscript text_to_wordmatrix.R [-h] [-i INPUT] [-o OUTPUT] [-n NUMBER] [-r] [-l] [-w] [-s] [-p] | |
120 ``` | |
121 | |
122 Optional arguments: | |
123 ``` | |
124 -h, --help show help message | |
125 -i INPUT, --input INPUT input file name. add path if file is not in working directory | |
126 -o OUTPUT, --output OUTPUT output file name. [default "text_to_wordmatrix_output"] | |
127 -n NUMBER, --number NUMBER number of most frequent words that should be extracted per row [default "50"] | |
128 -r, --remove_num remove any numbers in text | |
129 -l, --lower_case by default all characters are translated to lower case. otherwise use -l | |
130 -w, --remove_stopwords by default a set of english stopwords (e.g., 'the' or 'not') are removed. otherwise use -w | |
131 -s, --stemDoc apply Porter's stemming algorithm: collapsing words to a common root to aid comparison of vocabulary | |
132 -p, --plurals by default words in plural and singular are merged to the singular form. otherwise use -p | |
133 -- install_packages if you want to auto install missing required packages | |
134 ``` | |
135 | |
136 Output: | |
137 | |
138 A binary matrix in that each column represents one of the extracted words. | |
139 | |
140 ## pmids_to_pubtator_matrix | |
141 | |
142 The tool uses all PMIDs per row and extracts "Gene", "Disease", "Mutation", "Chemical" and "Species" terms of the corresponding abstracts, using PubTator annotations. The user can choose from which categories terms should be extracted. The extracted terms are united in one large binary matrix, with 0= term not present in abstracts of that row and 1= term present in abstracts of that row. The user can decide if the scientific terms should be extracted and used as they are or if they should be grouped by their geneIDs/ meshIDs (several terms are often grouped into one ID). Also, by default all terms are extracted, otherwise the user can specify a number of most frequent words to extract per row. | |
143 | |
144 Input: | |
145 | |
146 Output of 'abstracts_by_pmids' tool, or tab-delimited table with columns containing PMIDs. The names of the PMID columns should start with "PMID", e.g. "PMID_1", "PMID_2" etc. | |
147 | |
148 Usage: | |
149 ``` | |
150 $ Rscript pmids_to_pubtator_matrix.R [-h] [-i INPUT] [-o OUTPUT] [-b BYID] [-n NUMBER][-c {Gene,Disease,Mutation,Chemical,Species} [{Gene,Disease,Mutation,Chemical,Species} ...]] | |
151 ``` | |
152 | |
153 Optional arguments: | |
154 ``` | |
155 -h, --help show help message | |
156 -i INPUT, --input INPUT input file name. add path if file is not in workind directory | |
157 -o OUTPUT, --output OUTPUT output file name. [default "pmids_to_pubtator_matrix_output"] | |
158 -b, --byid if you want to find common gene IDs / mesh IDs instead of specific scientific terms. | |
159 -n NUMBER, --number NUMBER number of most frequent terms/IDs to extract. by default all terms/IDs are extracted. | |
160 -c [...], --categories [...] PubTator categories that should be considered [default "('Gene', 'Disease', 'Mutation','Chemical')"] | |
161 -- install_packages if you want to auto install missing required packages | |
162 ``` | |
163 | |
164 Output: | |
165 | |
166 Binary matrix in that each column represents one of the extracted terms. | |
167 | |
168 ## simtext_app | |
169 | |
170 The tool enables the exploration of data generated by ‘text_to_wordmatrix’ or ‘pmids_to_pubtator_matrix’ tools in a Shiny local instance. The following features can be generated: 1) word clouds for each initial search query, 2) dimension reduction and hierarchical clustering of binary matrices, and 3) tables with words and their frequency in the search queries. | |
171 | |
172 Input: | |
173 | |
174 1) Input 1: | |
175 Tab-delimited table with | |
176 - A column with initial search queries starting with "ID_" (e.g., "ID_gene" if initial search queries were genes). | |
177 - Column(s) with grouping factor(s) to compare pre-existing categories of the initial search queries with the grouping based on text. The column names should start with "GROUPING_". If the column name is "GROUPING_disorder", "disorder" will be shown as a grouping variable in the app. | |
178 2) Input 2: | |
179 The output of ‘text_to_wordmatrix’ or ‘pmids_to_pubtator_matrix’ tools, or a binary matrix. | |
180 | |
181 Usage: | |
182 ``` | |
183 $ Rscript simtext_app.R [-h] [-i INPUT] [-m MATRIX] [-p PORT] | |
184 ``` | |
185 | |
186 Optional arguments: | |
187 ``` | |
188 -h, --help show help message | |
189 -i INPUT, --input INPUT input file name. add path if file is not in working directory | |
190 -m MATRIX, --matrix MATRIX matrix file name. add path if file is not in working directory | |
191 -p PORT, --port PORT specify port, otherwise randomly selected | |
192 --host specify host | |
193 -- install_packages if you want to auto install missing required packages | |
194 ``` | |
195 | |
196 Output: | |
197 | |
198 SimText app |