Chapter 11 – Search Server 2010 and FAST Search: Architecture and Administration

MS Sharepoint 2010 Admin’s Companion eBook (P. 525) 01/17/2011, 01:19AM

Enterprise Search (the one that comes with Sharepoint 2010):
1. Ensure that enterprise data from multiple systems can be indexed – including Sharepoint sites, files in file share, web pages in other webs sites, third party respositories, and LOB such as CRM or ERP solutions.
2. Content from multiple enterprise repositories can be queried both (1) independently and (2) from within the context of your business app.
3. Provide ranking of search results

Sharepoint 2010 provides a Connector Framework
that enables the crawler to index files, metadata and other types of data.

5 search products —
(1) MS Sharepoint Foundation 2010
(2) Search Server 2010 Express
(3) Search Server 2010
(4) Sharepoint 2010
(5) Fast Search Server 201 for Sharepoint

Sharepoint Foundation 2010 – 10 million indexed items.
Sharepoint 2010 – 100 million indexed items.
Fast Search 1 billion indexed items.

Search Tools —

(1) Language detection – supports 53 languages.
(2) Word breakers – Or Tokenizers. Separate words at spaces, punctuation, and special chars.
(3) Custom Dictionaries – a custom dictionary file defines words or terms for the word breaker to consider as complete words, and shouldn’t be broken. For example, AT&T. Can be edit in text editor but must be saved in Unicode format
as CustomNNNN.lex, where NNNN is the language hex code. Each line ends with CR + LF.
Custom dictionary file rules:
case –insensitive.
“|” char is not permitted.
No space permitted
# cannot be in the first char.
All other chars are valid
Max length of an entry is 128 chars.

01/17/2011, 04:27AM
01/17/2011, 10:01AM
(4) iFilter – tells the crawler how to crack open a file and to identity and index its contents. For example, if you need to index PDF file, go to Adobe and download their iFilter for PDF. It’s an executable file.
(5) Connector – improvements —
* Attachments can now be crawled.
* Item level descriptors can be retrieved for external data exposed by BDC.
* When crawling a BDC entity, additional entities can be crawled, maintaining the entity relationship.
* Time stamp-based incremental crawls.
* Change log crawls that can remove items that have been deleted since the last crawl
are supported.

Sharepoint 2010 indexing connectors —
* Sharepoint Content – the crawls access data via (1) Web Service and (2) Windows Authentication. For incremental crawls, it uses change log.
* File Shares – access file shares via Windows Authentication.
* Web sites – uses link traversal
as the crawl method but does not provide security trimmer.
* People profiles — crawled via the profiles pages of My Site Host. Only the info that’s exposed to Everyone
is crawled.
* Lotus notes – Connector framework
* Exchange public folders — – Connector framework
* External systems – build custom connector using Connector Framework in Sharepoint Designer 2010.

Search components and processes —
Content Source – target servers that have the contents that needs indexed.

Connectors can use different protocols, such as FTP, HTTP or RPC.

Two parts in gathering information from Content Source: (Important about how crawling works)
Part 1:
enumeration of content items that should be crawled. Connectors connect to content source and walk the URL’s and save the URL of each content item in the table MSSSCrawlQueue table.
Basic search engine: MSSearch.exe. When a crawl started, a new process spawned: Mssdmn.exe. The content source will appear as
Starting.

Part 2: Once done (per batch), another mssdmn.exe spawned and connect to the URL’s in this batch and opening each URL and download (1) metadata and (2) document contents. The content source will appear as Crawling.

The number of crawl components across multiple servers can be increated as the workload increases, with automatic distribution of addresses, specific crawl addresses, assigned to a specific crawl component through Host Distribution Rules.

Crawl partitioning
speeds up the crawling due to reduced workload.

Index partition (new in Sharepoint 2010) – no single query component searches the entire index. The workload is spread across multiple servers. Is based on DocumentID assigned to each document.

Query Federation – the formatting of queries
in OpenSearch Definition so that they may be processed by an OpenSearch Query Component. Essentailly, federated queries go to multipl query components that respond individually, and the results are compiled to be presented in a Web Part.

Since no single query component holds the complete index, the Query Processor Service
manages disbursing the queries and processing the multiple results list returned, with a round-robin load-balancing method (served by Search Query and Site Settings service)

At least one Search Query and Site Settings must be running to serve queries. The service should be started on all servers that has query component. Can be started on the “service on servers” page, or with this Powershell command:
Start-SPEnterpriseSearchQueryandSiteSettingsServiceInstance

Each query component responds to queries and send the results from the Index partition to the query processor from which it received the query. Query component is also responsible for work breaking, noise word removal, and stemming for the search terms provided by the query processor.

A Host Distribution Rule can assign a specific host to a crawl database.

Farm and Application Configuration – (UI management tool)
Farm-wide search settings — Central Admin General Application Settings Search Farm Search Administration

Crawler Impact Rules – Control the rate (or speed) at which the crawler indexes a content source.
Defining the sites for crawler impact:

You can use any of the followings:
site name: www.contoso.com
all inclusive: *
domain: *.contoso.com
Machine: WFE01

You can change (1) the number of simultaneous request at a time, or (2) one document at a time and X seconds as interval.

Manually Creating Search Service Application — (same process as creating a new service application)
Central Admin Application Management Manage Service Application New Search Service Application

First step is to create content source: click Content Source Link (Quick Launch) to open the Manage Content Source page. When creating a content source, you select a content source type:

sps3://mysites
is a special crawl of the user profiles using the profile pages of the My Site Host.

A start address cannot appear in more than one content source but doesn’t not have to start with the root of an application. A single content source can contain up to 50 start addresses, and a search service application have up to 50 content sources.
One service application = 50 content sources, one content source = 50 starting addresses.

*** If starting an incremental crawl on a content source that’s never been crawled before, a full crawl will start.

Creating and Managing a Crawl Rule – (Crawl Rule is not
a Crawl Impact Rule)

Crawl Rule – configure include/exclude rule, specific security contexts
for crawling that are different from the default access account. Relative to the target URL, not the content source.

Using Server Name Mappings—Replace research result path with another.

Host Distribution Rule – specific address can be assigned to specific crawl databases using Host Distribution Rule. Works only if there are more than one crawl database.
Simply (1) enter the host name, and (2) select a crawl database.

Managing File Types – “What types of files should be crawled?” *** Crawl component will only request file types from content sources that appear on this page.
If you need to crawl a new file type, such as PDF file, be sure to:
1. Install the correct iFilter
2. Add the new file type (file extension)
3. Install the graphic image for the file (available from the file type’s manufacture)
4. Update the Docicon.xml.

Resetting the Index – it will erase all indexes from all query components, and no search results are available until a full crawl is completed!!!!

Managing Crawls – (Page 550, 01/17/2011, 10:40PM)
During a full crawl, the old index are replaced only when the full crawl is completed, which usually takes hours. So there will be a time when two full sets of indexes exist in the same hard drive at the same time – plan for disk space!

Incremental crawl –
for file systems and normal web crawls, the date/time stamp is compare with the crawl log history.
for Sharepoint sites, the change logs in the database are used.

Global Crawl Management – On Content Source page, click start all crawls to start all crawls:

Although the crawl process is read-only and does not modify the files, it will change the last read date on some files,
which can impact access auditing.
01/17/2011, 11:27PM (P. 552)

Content Source Crawl Management —
Each Content Source (on Content Ssource page) has a context menu where you can start, stop, resume a full or incremental crawl.


You can configure a list or library to be subject to crawls:
Go to your list/library, List Settings à Advanced Settings:

Diagnostic logging – Central Admin à Monitoring à Reporting à Configure Diagnostic Logging

On the same screen, you can enable the Event Log Protection to supress the logging of the same event repeatly until the condition returns to normal.

Managing the Search Service Topology – The topologu can be changed by “Modify Topology” – but not in a stand-alone installation!

Crawl database – contain configurations and instructions required by the crawl component, and tables to queue items to be crawled, and the crawl logs. Since a new crawl component must be associated with an existing or pending database, you need to create the database first.

Crawl Components – “Auto Host Distribution” – Sharepoint will make recommendation for redistribution.

Property databases – new property databases are created to improve query performance.

Index partitions and Query components – you can split the index into smaller partitions to speed up full-text queries. Each partition can contain 10 million items.
(1) select the farm member to host the index partition and query component
(2) select existing property database which which this index will be associated
(3) Specify the location for the index file

Fast Search server 2010 for Sharepoint
Enahncement:
* Can search in any language
* Can detect 84 languages
* Lemmatization (variations)
* Ability to sort on any metadata
* Retrieves metadata from entire search results, not just first 50 items

Architecture and Topology —

WFE – provide the Query and Federation Object Model, Query Web Service, Search Centers with the accompanying Web Parts to accet queries from the present results to users.
Sharepoint App Servers – provide the Fast Conten SSA and Query SSA
Fast Application Servers —
Database Server —
*** With FAST, the metadata is stored in optimised file system, not SQL.

Thisis a long, hard to chew chapter.

Advertisements
Post a comment or leave a trackback: Trackback URL.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: