GENTEXT

NAME
SYNOPSIS
DESCRIPTION
OPTIONS
ENVIRONMENT
FILES
AUTHOR
REPORTING BUGS
COPYRIGHT
SEE ALSO

NAME

gentext - generate text and stats files in NOAH database

SYNOPSIS

gentext [OPTIONS]

DESCRIPTION

gentext is an administration program that is called periodically to generate the necessary database files to support content searching of files in the NOAH database.

There may be a wrapper script in /usr/bin with noah- prefix that calls this program in the correct run directory.

3rd Party Programs
NOAH relies on a number of 3rd party programs to properly index the NOAH database for content searching.

If you are not sure if you have installed these 3rd party programs, go to http://www.nordicwind.ca/noah/3rdparty for guidance.

MS Windows
In a new installation of NOAH on a WINDOWS server, the email.bat file will contain a line that calls gentext with the -new option. This means that every time the email batch program is run, it will also index any recent uploads as well as update the content.sum summary files to make content searches efficient.

If you don’t have the NOAH Email Notification option, make sure you still schedule the email.bat to run every five minutes.

Since most searching is for documents that are more than 5 minutes old, a short delay on a Windows Server to index new documents should not be a concern.

Linux:
On a Linux server, the indexing of new uploads is done automatically at the time of the upload HOWEVER there is still a requirement to schedule the gentext program with the -update option to at least update the content.sum files for a more efficient content searching.

Indexing an Existing Database
If you all ready have a NOAH database and you are adding the Content Search option, you will want to index the existing files with the -missing option.

Content Search Limitations
The Content Search option uses text converters to generate .stats files for each file in the database. If the file is not a recognized file from the list of supported formats, the file is not indexed and no error is given.

Supported files for Content Searching:

Windows .doc and .xls (Word and Excel) files.

OpenOffice files (.sxw, .sxc)

Adobe .pdf (Portable Document Format) files.

text files (see noah.config text_params parameter).

post script .ps files. (not always able to extract words

correctly - especially first letter in capitalized words!)

Unsupported Formats
If a file is an unsupported format but it does have a filename extension, gentext will call the mkstats script with the extension as the 3rd parameter. This script does nothing unless a user customizes it to generate the .stats files for that file extension. Of course you will need a text extraction program for this to work!

See FILES section for the mkstats scripts.

If you want to suppress warnings about file formats that you will never want indexed, add the file’s extension to the ignore-ext parameter in the noah.config file.

OPTIONS

gentext [-force ] [-new] [-missing] [-file filename ] [-rebuild] [-update] [-v] [-h] [-help] [-database DATABASE]

-force
re-generate text/stats files for all files in database.

-new
generate text/stats files for new recently uploaded files.

-missing
only generate text/stats on files that are missing text/stats.

-file filename
index only this file (full path name)

-rebuild
rebuild content search content.sum files from scratch in database (these are consolidations of stats files to speed up searching).

-update
incrementally update content search content.sum files that are out of date based on flag files in order to catch up with recent file uploads.

-database DATABASE
Defaults to default but if you want to change it to a different database name you can do so with this command.

-v
Verbose mode .. print processing info.

-h
short usage help

-help
This detailed help document.

ENVIRONMENT

This program is found in the admin directory and expects to be run from there.

In a Debian distribution, this program has a link to the wrapper program in the admin directory typically /usr/lib/noah/admin and the link name has a noah- prefix and is found in /usr/bin.

FILES

/usr/bin/gentext

This is a link to wrapper

/usr/lib/noah/admin/wrapper

This is a generic bash shell wrapper to start the program in the admin directory.

/usr/lib/noah/database.config

This is typically a link to /etc/noah/database.config in a Debian release.

The maketext_dir parameter in database.config tells NOAH and gentext where the mkstats.* scripts can be found.

mkstats Scripts:
In Linux, the maketext directory is generally in the NOAH cgi-bin directory.

In a Windows environment maketext is in the admin directory.
mkstats-pdf.bash

Script to convert a pdf file to text and generate a .stats file for content searching.

mkstats-sxw.bash

Script to convert a sxw (Openoffice) file to text and generate a .stats file for content searching.

mkstats-xls.bash

Script to convert a xls (MS spreadsheet) file to text and generate a .stats file for content searching.

mkstats-ps.bash

Script to convert a ps file to text and generate a .stats file for content searching.

mkstats-doc.bash

Script to convert a doc (MS Word) file to text and generate a .stats file for content searching.

mkstats-sxc.bash

Script to convert a sxc (Openoffice) file to text and generate a .stats file for content searching.

mkstats-txt.bash

Script to take a text file and generate a .stats file for content searching.

mkstats.bash

Script shell as a model to generate a new mkstats-? script for another file format.

mkstats-test.cgi

Script to test for 3rd party programs.

NOTE that almost all these scripts use profile to read a text file and generate a .stats file. profile is found in the maketext directory along with the mkstats scripts.

AUTHOR

Harold Blount - Nordicwind Inc. www.nordicwind.ca

REPORTING BUGS

http://noah.@nordicwind.ca

COPYRIGHT

NOAH - Copyright (c) 2004-2012 Nordicwind Inc. All rights reserved. <http://www.nordicwind.ca>

This is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 or later <http://gnu.org/licenses/gpl.html>.

This software is distributed WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

SEE ALSO

noah-efetch noah-purge noah-gentext noah-help

Noah Document Management Server : http://noah.nordicwind.ca