Enhance your search with faceted navigation, result highlighting, fuzzy queries, ranked scoring, and more. Rich Document types such as PDF and MS Office formats that became the single most popular Solr Enterprise Search Server targets the Solr version. coauthor Eric Pugh, he wrote Solr Enterprise Search Server, the first document types, such as PDF and MS Office formats, that became the single- most.

Solr 1.4 Enterprise Search Server Pdf

Language:English, German, Dutch
Genre:Children & Youth
Published (Last):28.12.2015
ePub File Size:27.87 MB
PDF File Size:17.28 MB
Distribution:Free* [*Registration needed]
Uploaded by: STACIE

Book: Solr Enterprise Search Server - logo. PDF -icon. PDF Please run the browser showing this tutorial and the Solr server on the same machine so. I've finally finished the book “Solr Enterprise Search Server” with my co- author Eric. We are proud to present the first book on Solr and hope. Solr Enterprise Search Server [David Smiley, Eric Pugh] on the patch SOLR for Parsing Rich Document types such as PDF and MS.

Solr can run in any Java Servlet Container of your choice, but to simplify this tutorial, the example index includes a small installation of Jetty.

This will start up the Jetty application server on port , and use your terminal to display the logging information from Solr. You can see that the Solr is running by loading http: This is the main starting point for Administering Solr. Your Solr server is up and running, but it doesn't contain any data.

You can modify a Solr index by POSTing XML Documents containing instructions to add or update documents, delete documents, commit pending adds and deletes, and optimize your index.

The exampledocs directory contains samples of the types of instructions Solr expects, as well as a java utility for posting them from the command line a post. To try this, open a new terminal window, enter the exampledocs directory, and run " java -jar post. You have now indexed two documents in Solr, and committed these changes.

You can now search for "solr" using the "Make a Query" interface on the Admin screen, and you should get one result. Clicking the "Search" button should take you to the following URL You may have noticed that even though the file solr. This is because the example schema. Whenever you POST instructions to Solr to add a document with the same value for the uniqueKey as an existing document, it automatically replaces it for you.

You can re-post the sample XML files over and over again as much as you want and numDocs will never increase,because the new documents will constantly be replacing the old. Go ahead and edit the existing XML files to change some of the data, and re-run the java -jar post.

You can delete data by POSTing a delete command to the update URL and specifying the value of the document's unique key field, or a query that matches multiple documents be careful with that one! Since these commands are smaller, we will specify them right on the command line rather than reference an XML file. If you search for id: SPN it will still be found, because index changes are not visible until, and a new searcher is opened. To cause this to happen, send a commit command to Solr post.

Now re-execute the previous search and verify that no matching documents are found. Here is an example of using delete-by-query to delete anything with DDR in the name:.

Commit can be an expensive operation so it's best to make many changes to an index in a batch and then send the commit command at the end.

There is also an optimize command that does the same thing as commit, in addition to merging all index segments into a single segment, making it faster to search and causing any deleted documents to be removed. All of the update commands are documented here. To continue with the tutorial, re-add any documents you may have deleted by going to the exampledocs directory and executing. You can pass a number of optional request parameters to the request handler to control what information is returned.

For example, you can use the "fl" parameter to control what stored fields are returned, and if the relevancy score is returned Solr provides a query form within the web admin interface that allows setting the various request parameters and is useful when trying out or debugging queries. Solr provides a simple method to sort on one or more indexed fields. Use the 'sort' parameter to specify "field direction" pairs If no sort is specified, the default is score desc to return the matches having the highest relevancy.

Hit highlighting returns relevent snippets of each returned document, and highlights keywords from the query within those context snippets. The following example searches for video card and requests highlighting on the fields name,features. More request parameters related to controlling highlighting may be found here. Faceted search takes the documents matched by a query and generates counts for various properties or categories.

Links are usually provided that allows users to "drill down" or refine their search results based on the returned categories. Notice that although only the first 10 documents are returned in the results list, the facet counts generated are for the complete set of documents that match the query.

We can facet multiple ways at the same time. The following example adds a facet on the boolean inStock field:.

Solr can also generate counts for arbitrary queries. The following example queries for ipod and shows prices below and above by using range queries on the price field. One can even facet by date ranges. More information on faceted search may be found on the faceting overview and faceting parameters pages.

Text fields are typically indexed by breaking the field into words and applying various transformations such as lowercasing, removing plurals, or stemming to increase relevancy. The same text transformations are normally applied to any queries in order to match what is indexed.

The schema defines the fields in the index and what type of analysis is applied to them.

Related Interests

A full description of the analysis components, Analyzers, Tokenizers, and TokenFilters available for use is here. There is a handy analysis debugging page where you can see how a text value is broken down into words, and shows the resulting tokens after they pass through each filter in the chain. Each row of the table shows the resulting tokens after having passed through the next TokenFilter in the Analyzer for the name field.

First, go to the bin directory, and then run the main Solr command. On Windows, it will be solr. Jetty's start. You'll see a few lines of output as Solr is started, and then the techproducts collection is created via an API call.

Then the sample data is loaded into Solr. When it's done, you'll be directed to the Solr admin at http: To stop Solr, use the same Solr command script: A quick tour of Solr Point your browser to Solr's administrative interface at http: The admin site is a single-page application that provides access to some of the more important aspects of a running Solr instance.

The administrative interface is currently being completely revamped, and the below interface may be deprecated. In the preceding screenshot, the navigation is on the left while the main content is on the right. The left navigation is present on every page of the admin site and is divided into two sections.

The primary section contains choices related to higher-level Solr and Java features, while the secondary section lists all of the running Solr cores. The default page for the admin site is Dashboard. This gives you a snapshot of some basic configuration settings and stats, for Solr, the JVM, and the server. The Dashboard page is divided into the following subareas:. This area displays the Java implementation, version, and processor count.

The various Java system properties are also listed here. This area displays the overview of memory settings and usage; this is essential information for debugging and optimizing memory settings. This meter shows the allocation of JVM memory, and is key to understanding if garbage collection is happening properly. If the dark gray band occupies the entire meter, you will see all sorts of memory related exceptions!

This page is a real-time view of logging, showing the time, level, logger, and message. This section also allows you to adjust the logging levels for different parts of Solr at runtime.

For Jetty, as we're running it, this output goes to the console and nowhere else. See Chapter 11, Deployment, for more information on configuring logging. Core Admin: This section is for information and controls for managing Solr cores. Here, you can unload, reload, rename, swap, and optimize the selected core. There is also an option for adding a new core. Java Properties: This lists Java system properties, which are basically Java-oriented global environment variables.

Including the command used to start the Solr Java process. Thread Dump: This displays a Java thread dump, useful for experienced Java developers in diagnosing problems. Below the primary navigation is a list of running Solr cores.

Click on the Core Selector drop-down menu and select the techproducts link. You should see something very similar to the following screenshot:. The default page labeled Overview for each core shows core statistics, information about replication, an Admin Extra area. Some other options such as details about Healthcheck are grayed out and made visible if the feature is enabled.

You probably noticed the subchoice menu that appeared below techproducts. Here is an overview of what those subchoices provide:. This is used for diagnosing query and indexing problems related to text analysis. This is an advanced screen and will be discussed later. Data Import: Like replication, it is only useful when DIH is enabled. Provides a simple interface for creating a document to index into Solr via the browser. This includes a Document Builder that walks you through adding individual fields of data.

Exposes all the files that make up the core's configuration.

Everything from core files such as schema. Clicking on this sends a ping request to Solr, displaying the latency.

The primary purpose of the ping response is to provide a health status to other services, such as a load balancer. The ping response is a formatted status document and it is designed to fail if Solr can't perform a search query that you provide. Here you will find statistics such as timing and cache hit ratios. In Chapter 10, Scaling Solr, we will visit this screen to evaluate Solr's performance.

This brings you to a search form with many options. With or without this search form, you will soon end up directly manipulating the URL using this book as a reference. There's no data in Solr yet, so there's no point in using the form right now. This contains index replication status information, and the controls for disabling. It is only useful when replication is enabled. More information on this is available in Chapter 10, Scaling Solr.

Schema Browser: This is an analytical view of the schema that reflects various statistics of the actual data in the index. We'll come back to this later.

Segments Info: Segments are the underlying files that make up the Lucene data structure. As you index information, they expand and compress. This allows you to monitor them, and was newly added to Solr 5. You can partially customize the admin view by editing a few templates that are provided. The template filenames are prefixed with admin-extra, and are located in the conf directory.

We saw this data loaded as part of creating the techproducts Solr core when we started Solr. We're going to use that for the remainder of this chapter so that we can explore Solr more, without getting into schema design and deeper data loading options. For the rest of the book, we'll base the examples on the digital supplement to the bookmore on that later.

We're going to re-index the example data by using the post. Most JAR files aren't executable, but this one is. This simple program takes a Java system variable to specify the collection: Finally, it will send a commit command, which will cause documents that were posted prior to the commit to be saved and made visible. Obviously, Solr must be running for this to work. Here is the command and its output: POSTing file gbexample. The post.

Let's take a look at one of these XML files we just posted to Solr, monitor. The XML schema for files that can be posted to Solr is very simple. This file doesn't demonstrate all of the elements and attributes, but it shows the essentials. This subset may very well be all that you use. More about these options and other data loading choices will be discussed in Chapter 4, Indexing Data. A simple query Point your browser to http: The search box is labeled q.

That URL and Solr's search response is displayed to the right. It is convenient to use the form as a starting point for developing a search, but then subsequently refine the URL directly in the browser instead of returning to the form. At the top of the main content area, you will see a URL like this http: Most modern browsers, such as Firefox, provide a good JSON view with syntax coloring and hierarchical controls.

All response formats have the same basic structure as the JSON you're about to see. More information on these formats can be found in Chapter 4, Indexing Data.

The JSON response consists of a two main elements: Here is what the header element looks like: This lists the request parameters. You can see all of the applied parameters in the response by setting the echoParams parameter to true.

It does not include streaming back the response. Due to multiple layers of caching, you will find that your searches will often complete in a millisecond or less if you've run the query before. More information on these parameters and many more is available in Chapter 5, Searching.

Next up is the most important part, the results: The numFound value is the number of documents matching the query in the entire index. The start parameter is the index offset into those matching ordered documents that are returned in the response below.

Often, you'll want to see the score of each matching document.

Solr 1.4 Enterprise Search Server Book is Released!

The document score is a number that represents how relevant the document is to the search query. This search response doesn't refer to scores because it needs to be explicitly requested in the fl parametera comma-separated field list. It's independent of the sort order or result paging parameters.

The content of the result element is a list of documents that matched the query. The default sort is by descending score. Later, we'll do some sorting by specified fields. The document list is pretty straightforward. By default, Solr will list all of the stored fields. Not all of the fields are necessarily storedthat is, you can query on them but not retrieve their valuean optimization choice.

Notice that it uses the basic data types of strings, integers, floats, and Booleans. Also note that certain fields, such as features and cat are multivalued, as indicated by the use of [] to denote an array in JSON. This was a basic keyword search.

As you start using more search features such as faceting and highlighting, you will see additional information following the response element. This page provides details on all the components of Solr. Before we loaded data into Solr, this page reported that numDocs was 0, but now it should be Solr isn't exactly REST-based, but it is very similar.

These statistics are accumulated since when Solr was started or reloaded, and they are not stored to disk. As such, you cannot use them for long-term statistics.

There are third-party SaaS solutions referenced in Chapter 11, Deployment, which capture more statistics and persist it long-term. The sample browse interface The final destination of our quick Solr tour is to visit the so-called browse interfaceavailable at http: It's for demonstrating various Solr features:. Query debugging: Here, you can toggle display of the parsed query and document score "explain" information. Here, you can start typing a word like enco and suddenly "encoded" will be suggested to you.

Here, the highlighting of query words in search results is in bold, which might not be obvious.

This includes field value facets, query facets, numeric range facets, and date range facets. This shows how the search results cluster together based on certain words.

You must first start Solr as the instructions describe in the lower left-hand corner of the screen. Geospatial search: Here, you can filter by distance. Click on the spatial link at the top-left to enable this. This is also a demonstration of Solritas, which formats Solr requests using templates that are based on Apache Velocity. Solritas is primarily for search UI prototyping. It is not recommended for building anything substantial. See Chapter 9, Integrating Solr, for more information. The browse UI as supplied assumes the default example Solr schema.

It will not work out of the box against another schema without modification. These configuration files are extremely well documented. A Solr core's instance directory is laid out like this:. This directory contains configuration files. The solrconfig. The previously discussed browse UI is implemented with these templates. This is a good place to put contrib JAR files, and their dependencies. You'll need to create this directory on your own, though; it doesn't exist by default. Unlike typical database software, in which the configuration files don't need to be modified much if at all from their defaults, you will modify Solr's configuration files extensivelyespecially the schema.

The as-provided state of these files is really just an example to both demonstrate features and document their configuration and should not be taken as the only way of configuring Solr.

It should also be noted that in order for Solr to recognize configuration changes, a core must be reloaded or simply restart Solr. Solr's schema for the index is defined in schema. You will observe that the names of the fields in the documents we added to Solr intuitively correspond to the sample schema.

There are various reasons for doing this, but they boil down to needing to index data in different ways for specific search purposes. You'll learn all that you could want to know about the schema in the next chapter. Each Solr core's solrconfig.

They make up about half of the file. In our first query, we didn't specify any request handler, so we got the default one: Each HTTP request to Solr, including posting documents and searches, goes through a particular request handler.

Related titles

The well-documented file also explains how and when they can be added to appends, or invariants named lst blocks. This arrangement allows you to set up a request handler for a particular application that will be searching Solr, without forcing the application to specify all of its search parameters.

More information on request handlers can be found in Chapter 5, Searching. What's next? You now have an excellent, broad overview of Solr! The numerous features of this tool will no doubt bring the process of implementing a world-class search engine closer to reality. But creating a real, production-ready search solution is a big task.

So, where do you begin? As you're getting to know Solr, it might help to think about the main process in three phases: Schema design and indexing In what ways do you need your data to be searched? Will you need faceted navigation, spelling suggestions, or more-like-this capabilities?

Knowing your requirements up front is the key in producing a well-designed search solution. Understanding how to implement these features is critical. A well-designed schema lays the foundation for a successful Solr implementation. However, during the development cycle, having the flexibility to try different field types without changing the schema and restarting Solr can be very handy.

The dynamic fields feature allows you to assign field types by using field name conventions during indexing.

Solr provides many useful predefined dynamic fields. Chapter 2, Schema Design, will cover this in-depth. However, you can also get started right now. The dynamicField, XML tags represent what is available. For the stock dynamic fields, here is a subset of what's available from the schema. Copying an example file, adding your own data, changing the suffixes, and indexing via the SimplePost tool is all as simple as it sounds. Give it a try! Text analysis It's probably a good time to talk a little more about text analysis.

When considering field types, it's important to understand how your data is processed. For string types, you'll also need to think about how the text is analyzed. Simply put, text analysis is the process of extracting useful information from a text field. This process normally includes two steps: Analyzers encapsulate this entire process, and Solr provides a way to mix and match analyzer behaviors by configuration. Tokenizers split up text into smaller chunks called tokens. There are many different kinds of tokenizers in Solr, the most common of which splits text on word boundaries, or whitespace.

Others split on regular expressions, or even word prefixes. The tokenizer produces a stream of tokens, which can be fed to an optional series of filters. Filters, as you may have guessed, commonly remove noisethings such as punctuation and duplicate words.

Once the tokens pass through the analyzer processor chain, they are added to the Lucene index. Chapter 2, Schema Design, covers this process in detail.

Searching The next step is, naturally, searching. It is not the default but arguably should be in our opinion; [e]dismax handles end-user queries very well. There are a few more configuration parameters it needs, described in Chapter 5, Searching. Here are a few example queries to get you thinking. Be sure to start up Solr and index the sample data by following the instructions in the previous section.

Find all the documents that have the phrase hard drive in their cat field: Find all the documents that are in-stock, and have a popularity greater than 6: Here's an example using the eDisMax query parser: This returns documents where the user query in q matches the name, manu, and cat fields. Faceting and statistics can be seen in this example: For detailed information on searching, see Chapter 5, Searching.

Integration If the previous tips on indexing and searching are enough to get you started, then you must be wondering how you integrate Solr and your application. You can make use of one of the many HTTP client libraries available. Here's a small example using the Ruby library, RSolr: Using one of the previous sample queries, the result of this script would print out each document, matching the query ipod.

There are many client implementations, and finding the right one for you is dependent on the programming language your application is written in. Chapter 9, Integrating Solr, covers this in depth, and will surely set you in the right direction. Resources outside this book The following are some Solr resources other than this book:.

It is a style of book that comprises a series of posed questions or problems followed by their solution. You can find this at www. Apache Solr Reference Guide is a detailed, online resource contributed by Lucidworks to the Solr community. You can find the latest version at https: Consider downloading the PDF corresponding to the Solr release you are using.

Solr's Wiki at http: For a Wiki, it's fairly organized too. In particular, if you use a particular app-server in production, then there is probably a Wiki page there on specific details. Read them! The solr-user lucene. If you have a few discriminating keywords, then you can find nuggets of information in there with a search engine. The mailing lists of Solr and other Lucene subprojects are best searched at http: We highly recommend that you subscribe to the Solr-users mailing list.

Solr 1.4 Enterprise Search Server

You'll learn a lot and potentially help others, too. Solr's issue tracker contains information on enhancements and bugs. It's available at http: Some of the comments attached to these issues can be extensive and enlightening. You'll see such references in this book and elsewhere.

If you intend to dive into Solr's internals, then you will find Lucene resources helpful, but that is not the focus of this book. Summary This completes a quick introduction to Solr. In the following chapters, you're really going to get familiar with what Solr has to offer. We recommend that you proceed in order from the next chapter through Chapter 8, Search Components, because these build on each other and expose nearly all of the capabilities in Solr. These chapters are also useful as a reference to Solr's features.

You can, of course, skip over sections that are not interesting to you. Chapter 9, Integrating Solr, is one you might peruse at any time, as it may have a section applicable to your Solr usage scenario.

Finally, be sure that you don't miss the appendix for a search quick-reference cheat-sheet. Alternatively, you can download the book from site, BN.

Click here for ordering and shipping details. Chapter No. Flag for inappropriate content. Related titles. VSafe v3. Jump to Page. Search inside document. Fr Third Edition Solr is a widely popular open source enterprise search server that delivers powerful search and faceted navigation featuresfeatures that are elusive with databases.

You can reach him at and view his LinkedIn profile here: He blogs at Kranti Parisa has more than a decade of software development expertise and a deep understanding of open source, enterprise software, and the execution required to build successful products. You can reach him on LinkedIn: In this chapter, we're going to cover the following topics: An overview of what Solr and Lucene are all about What makes Solr different from databases? How to get Solr, what's included, and what is where?

Running Solr and importing sample data A quick tour of the admin interface and key configuration files A brief guide on how to get started quickly An introduction to Solr Solr is an open source enterprise search server. Search enhancing features. There are many, but here are some notable ones: A highlighter feature to show matching query terms found in context.

A query spellchecker based on indexed content or a supplied dictionary. Multiple suggesters for completing query strings. Some of Solr's most notable features beyond Lucene are as follows:Solr's Wiki at http: Fr Third Edition Solr is a widely popular open source enterprise search server that delivers powerful search and faceted navigation featuresfeatures that are elusive with databases.

The most important comparison to make is with respect to the data model—that is the organizational structure of the data. Raspberry Pi. Like replication, it is only useful when DIH is enabled.

ELDA from Rancho Cucamonga
I do love exploring ePub and PDF books readily. Also read my other articles. I have a variety of hobbies, like community.