 |  |
FAQ for Net Research Server
The Questions
- Background
- What is Net Research Server?
- How and why was Net Research Server created?
- OK, so how does Net Research Server compare to other search engines?
- Where can I get Net Research Server?
- General Technical Questions
- "Why can't I ...? Why won't ... work?" What to do in case of problems
- What's the best hardware/operating system/... How do I get the most out of my Net Research Server?
- Why isn't there a binary for my platform?
- Starting Net Research Server
- Why does Net Research Server no longer start up?
- How can I fix a corrupt database?
- How can Net Research Server run on port 80?
- How can Net Research Server start on boot?
- How can I stop Net Research Server?
- Error Log Messages
- Where is the log file?
- How can I get more log information?
- What are the common errors found in the log?
- ODP Configuration Questions
- How do I specify when to import ODP?
- How do I import a subset of ODP?
- How do I exclude ODP categories during import?
- How do I import the Adult ODP category?
- Crawler Configuration Questions
- What is a collection?
- How do I crawl and index a subset of ODP?
- How do I start the crawler?
- Mail Configuration Questions
- Why does Net Research Server have a mail engine?
- How do I setup SMTP and POP3?
- How do I setup virtual mail domains?
- Webserver Configuration Questions
- How can the webserver run on a particular domain/IP/port?
- How many webserver threads should I use?
- Where are the web logs stored?
- How do I set the default page?
- Rendering Configuration Questions
- How are HTML pages rendered?
- Which XSLT engine is used?
- How do I modify an XSL template?
- How do I access a page as XML data?
- Database Configuration Questions
- How can I backup my databases?
- How can I rebuild my databases?
- Access Control Questions
- How can I control access to admin?
- How can I control access to my application?
- How do I change user login and signup?
- Template Configuration Questions
- What is a template?
- How can I add a search engine to my template?
- How can I add a news source to my template?
- How do I aggregate XML/RSS feeds?
The Answers
A. Background
- What is Net Research Server?
Net Research Server
- is a powerful, flexible, application server
- implements web crawling, indexing, and search
- imports the Open Directory Project or a custom directory
- is highly configurable
- comes in binary form and installs with no dependencies
- provides cross-platform support for Windows and Linux
- creates applications to provide web users with features including:
- News aggregation
- allows you to easily aggregate any news source
found on pages on the web. News sources
are added by specifying the URL of the
page on which the news is found, and
specifying a parse rule for the news
item title, description, and any other
metadata.
- Search engine aggregation
- Allows you to create meta search pages that
provide search results from multiple
search engines found on the web. Search
engines are added by specifying the
URL on which the search engine form
is found, and then specifying a search
result parse rule. The parse rule lets
you specify which metadata to extract
from the page.
- Open Directory content
- Allows you to import the Open Directory Project's
RDF dump, which provides over 500,000
categories and over 5 million site
listings. You can customize which areas
of ODP to import, browse, search, crawl,
and index. Alternatively, use your own directory and combine it with extra metadata for a rich search experience.
- Applications
- Net Research Server can organize content
into applications. Applications are
built entirely within an HTML application
editor and provide a customizable navigation
interface to all the content pages in
the application. The pages are organized
into tabs, and a drop down menu for
each tab listing all pages.
- Personalization
- Users can signup for an application, login,
create news subscriptions by e-mail,
create custom news or search pages with
selected sources, monitor search results
from search engines, monitor web pages
for changes, organize content into folders,
and send/receive mail.
- Wiki
- Import a Wiki or create your own. Incorporate Wiki answers and wiki metadata into search results.
- Tagging
- Users can import their bookmarks and submit more. By tagging the bookmarks you can find them easily using tag lists, tag
clouds, or search.
- Customizability
- You can customize the entire application
server HTML by modifying XSL templates.
The application editor lets you build
custom applications. Templates let you
add your own content. All content is
available as XML.
- How and why was Net Research Server created?
Net Research Server was created to provide innovative
ways to bring together web content from various
sources into a rich and unified interface.
As the web grows, information overload becomes
more and more of a problem. NRS addresses
this with it's application metaphor which
lets you build a portal featuring all the
information in one place. NRS is written in C++
and is available for Linux and Windows.
- OK,so how does Net Research Server compare to other
search engines?
NRS is highly scalable. It crawls and indexes
over 10 million pages per index. Federated search lets you search into multiple indexes in the same amount of time.
NRS is unique in bringing together many search
technologies into one unified platform, and
providing a rich interface to create research
solutions and portals.
- Where can I get Net Research Server?
The latest version is always found on the download page. We
have free versions of NRS restricted to 10,000 documents. You can upgrade the document limit easily within NRS.
B. General Technical Questions
- "Why can't I ...? Why won't ... work?" What to do in case of problems
If you are having trouble with NRS, you should take the following steps:
NRS generates a log file that you can find in
the application directory. You can also
access the log through the admin interface
on the webserver tab, by clicking on the
"system log" link.
The latest version of this Frequently-Asked
Questions list is regularly updated.
LoopIP has 3 levels of support:
- free support: free support
includes help by e-mail and phone in getting NRS
up and running, and general configuration
and usage questions.
- hourly support: hourly support is provided by first purchasing support hours within NRS. This kind of support offers hands-on help in configuration, installation, customization of templates, and solution development.
- support plan: if you have purchased an NRS
maintenance agreement, priority support
is provided for the length of agreement,
as well as priority patches and bug fixes.
All e-mail support inquiries should be made to support
- What's the best hardware/operating system/... How do I get the most out of my Net Research Server?
The Windows platform provides the following advantages:
- faster XSLT engine: if your Windows installation
has the Microsoft XML 4.0+ parser installed,
it is used over the internal Sablotron engine
and provides significantly faster page rendering
performance.
- threading architecture: NRS uses multiple threads
to deal with concurrent user requests. The
Windows architecture provides better performance
when large numbers of threads are in use.
This issue has been fixed under Linux in
2.5+ kernels.
The Linux platform provides the following advantages:
- no aggressive filesystem caching: Windows
server installations use filesystem caching
options by default that impact database
performance. These settings can be modified
in the Windows registry.
Generally you will not notice much difference between the two platforms.
NRS is disk intensive and benefits from a fast
striped RAID disk array. NRS operates with
as little as 512MB of RAM, but requires up
to 2GB of RAM when indexing over 10 million
full-text documents.
It is also possible under Linux to use raw disk
partitions and software RAID to further increase
performance.
Dual CPU machines provide better responsiveness
during update cycles, as NRS can crawl/index
in the background whilst still serving pages.
Also due to the multithreaded non-blocking
architecture of NRS, better scalability is
achieved with more CPUs.
NRS requires 7GB diskspace to import and keep
up-to-date the ODP directory, and requires
a further 12GB per million documents crawled
and indexed.
Using Flash solid state storage (SSD) disk drives, it is
possible to increase search performance from a couple of searches a second to 50 or more searches a second.
- Why isn't there a binary for my platform?
Binaries are provided for the Windows platform, and for the i386 Linux 2.4+ platform. Contact us for additional platform support.
C. Starting Net Research Server
- Why does Net Research Server no longer start up?
By default the NRS webserver binds to "localhost"
or 127.0.0.1, and to the port number 2012. You must purchase an NRS upgrade, by registering and making a purchase within the NRS admin, to be able to change the address from localhost:2012. Check the log file to see if
an error occurred starting the webserver.
If so you can start NRS with the command-line
option: -address "host:port" where host is
an IP address or fully qualified domain name,
and port is the port number such as 2012 or
80. Under Linux only the root user can start
services on port numbers below 1024. So if
you want to start NRS on port 80 under Linux,
make sure you are the root user (you can use the su command to become root, or use the sudo tool), and that
no other service such as Apache is listening
on the same port.
It is also possible you have run out of disk space.
- How can I fix a corrupt database?
In the unlikely event of database corruption
it is possible to delete the databases in
question:
- deleting settings.db will require reset all system
flags, but keep intact everything else.
- deleting template.db will remove all templates, collection definitions, metadata definitions, cookies.
- deleting dir.db and dirindex.db will remove the directory.
- deleting form.db will delete search engine definitions
and require resetting some search engine elements on templates.
- deleting web.db and webindex.db will delete full-text
documents and search indexes. Also alert pages will be deleted.
- deleting mail.db will delete user mail accounts and mailbox content.
- deleting agent.db will delete all user alerts.
- deleting tag.db will delete all user bookmarks.
- deleting wiki.db and wikiindex.db will delete all wiki content.
- How can Net Research Server run on port 80?
Make sure you have no other service running on
port 80 such as IIS or Apache. Under Linux,
you need to be a root user to start a service
with a port number under 1024. Most webservers
will also by default bind to all IP's on port
80, causing a conflict. You can specify IIS
or Apache to only use particular IP's and
let NRS use others.
For Apache, you need to configure Apache to only listen on the IP addresses and ports it needs by using the "Listen" configuration option in the httpd.conf file. For example: Listen mywebsite.com:80
For Microsoft Internet Information Server (IIS), if you have IIS 5.0 read the knowledge base entry that explains how to disable socket pooling. If you have IIS 6.0, read the knowledge base entry that explains how to list the IP addresses that IIS will use.
- How can Net Research Server start on boot?
Under Linux, you can add a line to /etc/rc.d/rc.local:
For example: /home/nrs/nrsd -address "127.0.0.1:80"
&
Under
Windows, NRS is installed by default as a
service and will thus start automatically.
Use the Service Manager to stop and start
the service. You can remove NRS as a service
with the command-line: -removeservice. You
can re-install NRS as a service with: -installservice.
- How can I stop Net Research Server?
There are many ways to stop NRS:
- In admin, click the "Quit Server" button.
- Under Linux, find out the name of the executable with:
ps -A. Then kill the process using: killall
-QUIT processname
- Under Windows, stop the service using the service
manager. Or if running in console mode, press Ctrl-C
D.Error Log Messages
- Where is the log file?
The log file is found in the application directory.
It is also accessible from the admin interface
under the Webserver tab by clicking on the
"system log" link.
- How can I get more log information?
You can start NRS with the command-line: -verbose
, or turn on verbose log in the admin webserver
tab. The log will now include extra information
such as each URL request received, each URL
crawled, and each URL indexed. This information
can be valuable to our support team when determining
support issues.
If you have crawl/indexing problems you can determine
exactly what is happening with this option.
- What are the common errors found in the log?
Look in the log for any line beginning with "Err"
for serious errors. These can include:
- no licence found: make sure your license
file is in the application directory.
- invalid license: invalid license. Contact support
to fix your license.
- web index update failed: Serious errors
occurred during full-text indexing.
- directory update failed: NRS was unable to download
the RDF dump, or the RDF dump was invalid.
By default, NRS downloads the RDF dumps
from http://rdf.dmoz.org/rdf/. Make sure
the following files exist:
http://rdf.dmoz.org/rdf/structure.rdf.u8.gz
http://rdf.dmoz.org/rdf/content.rdf.u8.gz
If these files are missing you can change
the dmoz rdf path in admin, or using the
command-line option: -dmozurl "http://xxx.com/yyy/zzz/"
- bad address: NRS could not start the web
server with your given address. Change it
using the command-line option: -address
"hostname:port"
E.ODP Configuration Questions
- How do I specify when to import ODP?
You can specify the date of the next update in
this format: mm/dd/yyyy. You can then also
specify the hour at which to update, and the
number of days until the next update.
ODP updates happen in the background, and once
ODP has been imported a database swap is made.
This database swap results in a few seconds
of downtime.
To disable ODP updates, in the "update date"
field, specify "disable".
- How do I import a subset of ODP?
You can specify a new default subset by using
the command-line option: -dmozroot "newrootlist"
or using the admin interface. You can specify
one or more category ID's that will form the
new root of the default directory template.
The next ODP import will then only import
this set of content. You can see the new content
subset on the directory template, or any new
template of type "directory" you create.
- How do I exclude ODP categories during import?
You can specify a directory filter to exclude
particular categories. It is done by name,
for example:
Regional/North America/United States
would be specified in the category filter.
If you wanted to remove all categories except "Computers" you would specify:
Adult,Arts,Business,Games,Health,Home,Kids and Teens,News,Recreation,Reference,Regional,Science,Shopping,Society,Sports,World
- How do I import the Adult ODP category?
By default the adult section of the ODP is not
imported. To enable, specify the command-line
option: -pornfilter 0
or modify the porn filter setting in admin.
F. Crawler Configuration Questions
- What is a collection?
A collection is a way to crawl and index ODP
categories, a website, or a user library.
In a collection you specify either the URL
or the list of ODP category ID's to crawl
and index. You can specify how the crawler
should behave with settings to control how
much of a site to crawl, URL patterns to exclude
or include, stay on path, stay on site, robot
behavior, politeness, number of pages to crawl
per site, depth of crawl, and more. You can
also specify indexing settings such as rank
boost, popularity ranking through page link
analysis.
Collection definitions can be imported and exported from
the admin interface.
To test a collection, it must first be crawled,
then indexed. You can then create a template
for it, to search against it.
- How do I crawl and index a subset of ODP?
Create a collection of an arbitrary name using the
new collection form found under "Crawler|Collection
List". Then select the collection to edit
it. Under "Categories to Index" deselect all,
and specify a category id using the 'category
picker'. Review the crawling and indexing
settings, then click "Save". Click on the
crawler tab, and click the "Start" button
to start crawl. Once the crawl is finished,
or you can stop the crawler at any point,
click "Start reindex full text". You can monitor
the progress by clicking on the crawler tab.
Once finished, to test your collection you
need to create a corresponding template. Click
on "Crawler|Collection List|Create Template"
to automatically create one for your collection.
Then select your new template under the templates
tab, and do a search.
- How do I start the crawler?
The crawler can be manually started by selecting
the "Crawler|Start crawl" button. A schedule
can also be setup where you specify the next
crawl date, how long to crawl for, and how
many days to wait for the next crawl. Specifying
a date value of "disable" will disable the
crawl schedule.
G. Mail Configuration Questions
- Why does Net Research Server have a mail engine?
Providing a mail account for users provides a way to
organize in folders all user mail. NRS generates
mail for alerts placed on search results,
page changes, and new subscriptions. NRS also
has a feature to discover in the full-text
ODP index newsletters that can be signed up
for automatically. NRS also sends out mail
for alerts sent to external mail accounts.
- How do I setup SMTP and POP3?
Under the mail tab, you can specify the SMTP and
POP3 addresses in the form of "hostname:port".
Typically you use port 25 for SMTP and 110
for POP3. The SMTP address must correspond
to a valid MX record in your DNS records,
and also have a valid reverse DNS record,
for mail to be properly sent out to other
mail gateways.
- How do I setup virtual mail domains?
When you add a mail template to an application,
you can edit the properties of the mail template
and specify and new domain. All users who
have signed up with this application will
receive e-mail addresses with the given domain.
H. Webserver Configuration Questions
- How can the webserver run on a particular domain/IP/port?
You can use the command-line option: -address
"hostname:port"
Or specify the address in the admin interface
under "Webserver|Address".
The
"hostname" can be any domain/hostname/ip that
can be bound on the particular machine.
The
port can be any value from 1 to 65535. Under
Linux to use a port number under 1024 requires
root priviledges. Also make sure no other
service on the machine has bound the same
address. For example both Apache and IIS bind
by default to all IP's on port 80.
- How many webserver threads should I use?
Each concurrent user request requires a separate
thread. When a user requests a template page
it can sometimes take 30 or so seconds to
fully load, as determined by the gather timeout
value. For example when conducting a metasearch,
as the search results return from the search
engines they are streamed back to the user.
If retrieving search results from a search
engine takes more than the default timeout
of 30 seconds, the page will take 30 seconds
to load. So for example 100 threads would
allow 100 users to simultaneously request
a page. All other users are placed on a queue.
The listen queue size can be modified with
the "socket listen backlog" setting, and the
thread count can be modified with the "threads"
setting, both found on the web server tab..
- Where are the web logs stored?
By default log files for all user requests are
placed in the application directory. The log
files are of the format:
DATE TIME TIMETAKEN IP METHOD PAGE QUERY STATUS
USERAGENT REFERER COOKIE
To better organize your logs, create a new
directory and specify it under "Webserver|Web
log directory".
- How do I set the default page?
If no URL path is specified to NRS, NRS will
return the admin page. To modify this setting,
modify "Templates|Default Template Settings|Default
template" and specify a new default page or
path.
I. Rendering Configuration Questions
- How are HTML pages rendered?
HTML pages are rendered by performing an XSLT transform
on an XML stream and an XSL template. Each
template type has an associated XSL template
which can be overriden. In the case of streaming
templates such as templates of type "search"
an XSLT transform occurs multiple times for
the page, first to render the header of the
page, next to render the middle of the page
one or more times, and last to render the
footer of the page.
- Which XSLT engine is used?
NRS incorporates the Sablotron XSLT engine from
Ginger Alliance. Under Windows, if
the Microsoft XML 4.0+ parser is installed
it is automatically detected and used. The
Microsoft engine is generally faster and preferable.
- How do I modify an XSL template?
There are default XSL templates for each type of
template. You can access the default XSL templates
by selecting the template from the template
list and appending ".xsl" to it. Using your
favorite editor, you can modify the XSL file
and upload it under "Templates|Default template
settings" to modify default templates.
Each template can also have its own XSL template.
You can use the default XSL file, modify it,
and upload it under the template's property
page, or using the "group set" interface to
modify multiple templates at the same time.
The XSL file can either contain an entirely
new XSL file, or just the <xsl:template>
blocks you wish to override over the default
XSL templates.
- How do I access a page as XML data?
Any page can be retrieved as XML by appending
".xml" to the URL (before the query or ?).
Using Internet Explorer 5.0+, you can view
the XML and its structure. The major sections
are:
- /results/query:
reflects the URL query variables.
- /results/sysinfo:
reflects NRS system settings and licensing
- /results/user:
reflects some user account information
- /results/title:
reflects a page title
- /results/attribution:
reflects ODP attribution information
- /results/cat:
reflects an ODP category given by the "p"
query
- /results/dirsearchresults:
reflects a directory search
- /results/searches:
reflects meta search results
- /results/snippets:
reflects news results
- /results/xml:
reflects XML/RSS aggregation
- and
more
J. Database Configuration Questions
- How can I backup my databases?
You can backup all databases when NRS is not running
manually, or when running you can use the
admin interface to backup all databases to
a directory. During the backup proces, databases
are compacted improving their performance.
You can also specify a daily backup. Specify a
directory, under which NRS will create a directory
for every day of the week.
- How can I rebuild my databases?
If you need to rebuild your NRS installation,
it is possible for example to export all templates
to a file, delete all databases, and then
reimport your templates. Alternatively don't
delete settings.db to keep all system configuration,
don't delete template.db to keep your templates,
don't delete agent.db to keep your user alerts,
don't delete mail.db to keep your mail accounts
and mailboxes, don't delete dir.db and dirindex.db
to keep your ODP catalog, don't delete form.db
to keep search form definitions, and lastly
don't delete web.db and webindex.db to keep
your full-text index.
Generally any of these databases can be safely deleted
without compromising your NRS installation.
The only database interdependencies are between
dir.db and dirindex.db, and web.db and webindex.db,
so these 2 pairs of databases must always
be manipulated together.
K. Access Control Questions
- How can I control access to admin?
Use the command-line option: -adminpeer "accesslist"
Or specify the access list under "Webserver|Admin
peer ip".
The access list consists of a comma separated
list of ip addresses, or usernames.
- How can I control access to my application?
Templates in an application can be marked as secure.
If the user is not known, the user is redirected
to a template (usually mail or library) in
the application that has a login/signup option.
An application can also have a template of
type "application" that allows editing the
application. Access to this template is controlled
in the property page of the template.
- How do I change user login and signup?
Templates of type "mail" or "Library" feature login
and signup functionality. Any application
containing one of these templates on its tab
bar will redirect the user to this template
if the page they are accessing is marked as
secure. To modify the login and signup pages,
you need to edit the corresponding XSL template
and look for the "drawsignup" and "drawllogin"
sections.
L.Template Configuration Questions
- What is a template?
A template is a page that can be accessed through
the NRS webserver. A template can be of many
types such as "news", "search", "directory",
"application", "mail",.. Each template type
offers different kind of functionality. Templates
are commonly organized as applications by
providing an app name, and group name to their
settings. A template is rendered into HTML
by generating an XML stream for the page and
query parameters and performing an XSLT transform
on it and its corresponding XSL template.
A template can accessed as XML, HTML, or XSL
through the NRS webserver.
- How can I add a search engine to my template?
Create a template of type "search", edit it, and
specify the URL of a search engine, for example
http://www.google.com, where it says "add
new search". Click "Save". You can now edit
the new search, and specify its name, description,
and optionally a parse rule. A parse rule
is used to help NRS retreive the search results
off the page. The parse rule also enables
metadata extraction associated with search
results such as date, size, cached link, etc..
- How can I add a news source to my template?
Create a template of type "news", edit it, and specify
the URL of a news page on the web where it
says "add new snippet/news". Click "save",
then edit your news item, and specify its
parameters. You have to specify a parse rule,
the simplest of which might be: <title
link><desc>. If this does not work
you need to read up on the parse rule syntax
in the admin help and refer to the demo new
templates examples.
- How do I aggregate XML/RSS feeds?
NRS offers powerful aggregation and caching of
XML/RSS feeds. You specify all the URLs to
the feeds in a template, and for each you
can also define an XSLT identity transform
to optionally standardize the XML of each
source into the same vocabulary.
| |  |