gentext - generate text and stats files in NOAH database
gentext [OPTIONS]
gentext is an administration program that is called periodically to generate the necessary database files to support content searching of files in the NOAH database.
There may be a wrapper script in /usr/bin with noah- prefix that calls this program in the correct run directory.
3rd Party
Programs
NOAH relies on a number of 3rd party programs to properly
index the NOAH database for content searching.
If you are not sure if you have installed these 3rd party programs, go to http://www.nordicwind.ca/noah/3rdparty for guidance.
MS
Windows
In a new installation of NOAH on a WINDOWS server, the
email.bat file will contain a line that calls
gentext with the -new option. This means that every
time the email batch program is run, it will also index any
recent uploads as well as update the content.sum
summary files to make content searches efficient.
If you don’t have the NOAH Email Notification option, make sure you still schedule the email.bat to run every five minutes.
Since most searching is for documents that are more than 5 minutes old, a short delay on a Windows Server to index new documents should not be a concern.
Linux:
On a Linux server, the indexing of new uploads is done
automatically at the time of the upload HOWEVER there is
still a requirement to schedule the gentext program
with the -update option to at least update the
content.sum files for a more efficient content
searching.
Indexing an
Existing Database
If you all ready have a NOAH database and you are adding the
Content Search option, you will want to index the existing
files with the -missing option.
Content
Search Limitations
The Content Search option uses text converters to generate
.stats files for each file in the database. If the
file is not a recognized file from the list of supported
formats, the file is not indexed and no error is given.
Supported files for Content Searching:
• |
Windows .doc and .xls (Word and Excel) files. | ||
• |
OpenOffice files (.sxw, .sxc) | ||
• |
Adobe .pdf (Portable Document Format) files. | ||
• |
text files (see noah.config text_params parameter). | ||
• |
post script .ps files. (not always able to extract words |
correctly - especially first letter in capitalized words!)
Unsupported
Formats
If a file is an unsupported format but it does have a
filename extension, gentext will call the
mkstats script with the extension as the 3rd
parameter. This script does nothing unless a user customizes
it to generate the .stats files for that file
extension. Of course you will need a text extraction program
for this to work!
See FILES section for the mkstats scripts.
If you want to suppress warnings about file formats that you will never want indexed, add the file’s extension to the ignore-ext parameter in the noah.config file.
gentext [-force ] [-new] [-missing] [-file filename ] [-rebuild] [-update] [-v] [-h] [-help] [-database DATABASE]
-force
re-generate text/stats files for all files in database.
-new
generate text/stats files for new recently uploaded
files.
-missing
only generate text/stats on files that are missing
text/stats.
-file
filename
index only this file (full path name)
-rebuild
rebuild content search content.sum files from scratch
in database (these are consolidations of stats files to
speed up searching).
-update
incrementally update content search content.sum files
that are out of date based on flag files in order to
catch up with recent file uploads.
-database
DATABASE
Defaults to default but if you want to change it to a
different database name you can do so with this command.
-v
Verbose mode .. print processing info.
-h
short usage help
-help
This detailed help document.
This program is found in the admin directory and expects to be run from there.
In a Debian distribution, this program has a link to the wrapper program in the admin directory typically /usr/lib/noah/admin and the link name has a noah- prefix and is found in /usr/bin.
/usr/bin/gentext
This is a link to wrapper
/usr/lib/noah/admin/wrapper
This is a generic bash shell wrapper to start the program in the admin directory.
/usr/lib/noah/database.config
This is typically a link to /etc/noah/database.config in a Debian release.
The maketext_dir parameter in database.config tells NOAH and gentext where the mkstats.* scripts can be found.
mkstats
Scripts:
In Linux, the maketext directory is generally in the
NOAH cgi-bin directory.
In a Windows
environment maketext is in the admin directory.
mkstats-pdf.bash
Script to convert a pdf file to text and generate a .stats file for content searching.
mkstats-sxw.bash
Script to convert a sxw (Openoffice) file to text and generate a .stats file for content searching.
mkstats-xls.bash
Script to convert a xls (MS spreadsheet) file to text and generate a .stats file for content searching.
mkstats-ps.bash
Script to convert a ps file to text and generate a .stats file for content searching.
mkstats-doc.bash
Script to convert a doc (MS Word) file to text and generate a .stats file for content searching.
mkstats-sxc.bash
Script to convert a sxc (Openoffice) file to text and generate a .stats file for content searching.
mkstats-txt.bash
Script to take a text file and generate a .stats file for content searching.
mkstats.bash
Script shell as a model to generate a new mkstats-? script for another file format.
mkstats-test.cgi
Script to test for 3rd party programs.
NOTE that almost all these scripts use profile to read a text file and generate a .stats file. profile is found in the maketext directory along with the mkstats scripts.
Harold Blount - Nordicwind Inc. www.nordicwind.ca
http://noah.@nordicwind.ca
NOAH - Copyright (c) 2004-2012 Nordicwind Inc. All rights reserved. <http://www.nordicwind.ca>
This is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 or later <http://gnu.org/licenses/gpl.html>.
This software is distributed WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
noah-efetch noah-purge noah-gentext noah-help
Noah Document Management Server : http://noah.nordicwind.ca