Abbreviations (see also SMS Text Message Abbreviations): a shortened form of a word or phrase. Often of a letter or combination or grouping of letters from the original word or phrase. For example: Advert (Advertisement), Phone (Telephone), UK (United Kingdom), Dr. (Doctor). Abbreviations can cause problems for people using just key word search techniques as they may not be included even if they are potentially pertinent to the searchers intent. Including all implied abbreviations and full word versions of a term into a key word search engine can be laborious and time consuming. The main benefit is that if the user is looking for just the abbreviated version and not the full word equivalent (or vice versa) a key word search should enable this.
Acronyms: are a type of abbreviation usually formed by the first letters of each word in a phrase or name but may include additional letters in some instances, for example SWOT (Strengths, Weaknesses, Opportunities, Threats), WYSIWYG (What you see is what you get), RADAR (Radio Detection and Ranging). The term `acronym’ is often used to describe ‘initialisms’ – where the letters do not form a ‘word’, for example USB (Universal Service Bus), HMRC (Her Majesties Revenue and Customs), RNLI (Royal National Life Boat Institute), IoD (Institute of Directors). Acronyms can be challenging for basic search engines especially key word search engines for the same reason as abbreviations (see Abbreviations).
Annotators: (in computing terms) are components (code) which contain the logic that analyzes content (documents, images, video ) and then assesses/extracts descriptive data about the content and adds additional information which assists in describing the content. The extracted/added information is often called ‘meta data’ which can then be searched by search applications.
Actionable Search: Allows users to do something (take action) directly from a search result. Requires the search results to be accurate, reliable, and pertinent to the user need, including the context within which the search was made. Most applicable to ‘research’ type searches where the user is looking for information on a topic or subject without knowing exactly what they are looking for; as opposed to ‘navigational’ search where the user is looking for a specific item which they know exists.
Algorithm: is a procedure or formula for controlling a process flow or solving a problem. In computing it is usually code (software or a software component) which runs a number of logical processes or formulae calculations to determine required results (often from variable inputs) based on a structured and defined process or ‘logic’. Often used in search technologies to locate items and return a ‘hit list’ against a given search requirement along with a quality or ranking score relating to how well the located item matches the search criteria.
Attributes: Words or phrases which modify or describe nouns. For example a document can have many attributes which could include size, format, content (pictures, text), ‘type’ (memorandum, report, advert, contract ). Can be used for refining searches or for providing additional information regarding stored content items or returned search results.
Boolean Search: A type of search which includes additional parameters/operators relating to the search terms which enable a narrowing of the search compared to key word searching. Boolean logic consists of three primary logical operators:
Boost Words: A dictionary of terms (and/or phrases) that impact the ranking level of content items in which the term(s) appear. Enables adjustment of the ranking of search results based on organisations’ policies, historical search patterns or other criteria, to support enhanced efficiency and effective search results.
Categorization: is the process of grouping objects or information, usually based on their relationships or attributes. Categorisation enables groupings of items in a manner which assists in locating items, differentiating items and gaining understanding. A single item may fit within several categories. For example the category of ‘pigeon’ would contain all pigeons. The category of ‘birds’ would contain all birds (including pigeons), and the category of ‘living things’ would contain the category of ‘birds’ (which contains the category of ‘pigeons’) along with all other plants, animals and other ‘living things’. Many forms of categorization are possible for search purposes including ‘attributes’, for example ‘presentations’, ‘bid documents’, ‘contracts’, ‘.pdf documents’, ‘photos’.
Categorized Search Results (see also Categorization): are search results which are grouped into a meaningful classification based on categories.
Cataloguing: See Classification.
Character Normalization: identifies character combinations for which common or interchangeable alternatives are available, and therefore enables a search for one character set to locate accepted alternatives. There are a number of aspects of character normalization for example; interchange of upper and lower case characters (in either direction), removal of accents and other diacritics, umlaut expansion to ‘oe’ (in German).
Character normalisation allows search results to incorporate the widest potential set of applicable results without requiring the user to enter all the alternative forms and variants which may have been used by the creator/originator of the information.
Attention is required for multi-lingual corpora to enable normalisation of different languages allowing for such requirements as Asian full-width and half-width characters, Katakana middle dots (which are used as compound word delimiters in Japanese), diacritic removal for Hebrew and ligature expansion for Arabic.
Classification (see also Categorization): is often used to mean the same as categorization, however, in many cases people perceive cataloguing and classification as being a more defined and structured manner of categorization, often relating to a relatively simple taxonomy structure, as these terms originated in relation to physical libraries where only one manner of filing was possible (the book had only one location based on only a small number of key attributes e.g. author and ‘type’ (reference, fiction, science fiction) or ‘type’, ‘topic’, ‘subject’ e.g. Reference section, travel books, arranged alphabetically by country).
Classification Flux: the change in the classification of items often over comparatively short periods of time. With rapidly changing global business structures and methods classifications can no longer afford to be static and rigid. Classifications need to be able to change or flux.
Contraction Splitting: improves the quality of search results by identifying contractions and splitting them into their component parts. A contraction is similar to an abbreviation except that an abbreviation does not have its own pronunciation and is pronounced as the full word that is represents. A contraction does have its own distinctive pronunciation, e.g. the contraction ‘shouldn’t’ is pronounced differently from ‘should not’, and the contraction ‘isn’t’ is pronounced differently from ‘is not’. Splitting contractions (for example: ‘shouldn’t’ is split into ‘should’ + ’not’) enables greater granularity of results and greater precision in delivering pertinent results.
Controlled Vocabularies: mandate the use of predefined terms. Limits the number and type of terms that can be used and therefore provides a greater level of control. Greater control is enabled through restriction of the vocabulary that can be used. (See Natural Language Processing, where there is no restriction to the vocabulary used, for a comparison.)
Corpus: is a ‘body’ or collection of writings or documents. Previously often used to describe a finite set of writings or documents, such as those on a given topic or by a particular author; in the current corporate world there is an increasing production of ‘open-ended corpora’ where the ‘body’ of information is constantly being added to and modified which adds complexity to their ‘categorisation’ and increases the likelihood of ‘classification flux’. It is this scenario which often leads organisations to consider enhancing their search capabilities to enable more intuitive location of pertinent information across the corpus.
Crawler: Also known as a ‘bot’ or ‘spider’. A crawler is a software program that search engines use to seek out information from large sources of information such as a corporate intranet or the web. In most cases a crawler is given a ‘seed’ set of URLs to search and from these it locates the URL’s of linked pages and so on.
Custom Dictionary (see also Dictionary): is a dictionary of terms which varies from the standard dictionary. This may be a limited set of terms/words, terms in addition to a ‘standard’ dictionary and may, for example, include industry specific terminology, acronyms and abbreviations or ‘commonly misspelled words’. Custom dictionaries can aid in enterprise search situations by enabling greater granularity and performance (likelihood of pertinent results) for obscure or organisation-specific terms.
Dictionary (see also Custom Dictionary): was originally a book containing words and terms listed in alphabetical order along with their meanings and other information about them. In the IT and search context, dictionaries can refer to an organised list of words or terms for reference purposes, often by software or algorithms.
Disambiguation: is the ability to remove ambiguity from the meaning of a word or term. Ambiguity means that there is more than one meaning for a given word or term and therefore any search results could deliver items matching the word or term but for a meaning the user did not intend, or, could provide a long list of results where all the results for the different meanings are mixed up and any relevancy/ranking attempts to rank disassociated items, many of which are irrelevant to the searcher’s intended objective.
In order to provide pertinent and relevant search results, some form of disambiguation must be performed in order to understand the meaning of the term being searched for.
Disambiguation usually requires semantic analysis to understand the intent of the user. However, in some cases understanding the user’s role in an organisation can assist. Understanding the context of the search enables a certain level of automatic disambiguation – for example in a financial organisation a search for ‘bond’ is likely to refer to a financial instrument and not a film character, actor, helicopters or joining two items together with adhesive. However, it could still reasonably and usefully refer to a company name, a person’s name (e.g. a client) or a financial bond. Thus further disambiguation would be required – potentially using semantic analysis of the other terms used in the search enquiry or within the wider network or community of which the user is a member.
Other simple examples where disambiguation would be required can often include names (of places, people or products) e.g. understanding whether ‘Washington’ was intended as a place or a person’s name, whether ‘June’ is a month or a name, whether ‘Rob’ is an activity or a name, whether ‘Jaguar’ is a large cat or a car, whether ‘Polo’ is a mint, a small European car, a sport, a brand of clothing, a type of shirt etc. There can also be different meanings for the same word depending on the region (or country) e.g. ‘Fall’ in America can mean the same as ‘Autumn’ in England, ‘trunk’ in America can mean the rear luggage compartment of a car (‘boot’ in England. In England a trunk is usually part of a tree or part of an elephant or a major road, a piece of luggage, or a garment worn for swimming ‘a pair of trunks’); pants in America means the same as ‘trousers’ in England (and in England ‘pants’ are underwear). In English alone some words can have literally hundreds of potential meanings and uses – such as ‘run’, ‘go’, ‘take’, ‘stand’, ‘strike’, and ‘set’.
End-of-Sentence Marker Recognition: is the linguistic process of correctly identifying end-of-sentence markers for sentence segmentation, for example differentiating between a full stop (US ‘period’) at the end of a sentence and the same punctuation used as a separator, or abbreviation, or bullet point, part way through a sentence.
Enterprise Catalogue: a centralized and normalized metadata model for content across the enterprise, especially when that content is stored or managed in disparate systems.
Entity: Something that has a distinct, separate existence. In terms of search an entity can be an individual item often relating to a category, for example; a person (named or otherwise), a telephone number or a place.
Faceted Search/Navigation: Enables multiple criteria and combinations of terms to be used in refining searches, often with the use of Boolean operators and parentheses, but can also include aspects such as content type, geographical/geospatial and other factors.
False Positives & False Negatives: False positives are results provided when the result presented does not in fact match the search criteria. False negatives are results which are omitted when the results are displayed but which did match the search criteria and should have been displayed. (The term is used in many ways beyond information search scenarios). In search scenarios involving database information and structured data the likelihood of false positives and negatives is lower than in more advanced searches across unstructured information. These spurious ‘false’ results become more likely in analytical and research scenarios involving business intelligence and data analytics or in entity analytics and identity resolution scenarios. As a result sophisticated information and linguistic analysis techniques have developed to reduce the number of ‘false’ results and increase the quality of highly advanced search techniques. One of the most common reasons for false positives and false negatives in structured data and ‘simple’ information search scenarios stems from poor quality data/information which contains errors, omissions or significant spelling mistakes. See techniques for minimising ‘false’ results such as Character Normalization, Disambiguation, Federated Search, Lemmatisation, Morphological Variants, N-Grams, Synonyms …
Federated Search: the ability to search across more than one store of information simultaneously e.g. searching a document management system an intranet and a database to gain a consolidated results list containing the results from each/all of the repositories. Federated search in terms of the internet can mean only gaining a federated results list from the ‘top results’ list of each search engine to which the query is passed – this does not give a ‘true’ results list as it does not provide all the information from the resource or it mixes ‘top results’ ranked information with individual data hits. As such federated search tends to have greater value and more realistic ranking when used inside an enterprise rather than as a web search tool.
Fielded Search: reduces or limits the range of a query in order to increase the relevance of the search results by enabling the user to choose to limit a search to particular fields or attributes. The software then performs a search only for the attributes required within the fields/locations requested (rather than searching all fields or all information).
“Folksonomies”: A form of taxonomy usually developed by user communities being able to add meta tags to documents/content. This enables a number of interconnecting and overlapping taxonomies to build up, however due to the ‘open’ nature of the folksonomies they have very few parameters or restrictions which means there is little of no standardisation of classification, indexing or managing content. Folksonomies tend to change and update regularly compared with master data or structured enterprise taxonomies. Folksonomies can be added to content managed under a structured taxonomy with tools such as social bookmarking (e.g. digg, delicious, dogear )
Free Text Searching (see also Full Text Searching): the term ‘free text searching’ has been used in a number of ways. It can mean the searching of text fields within a database (often specific text fields where there are no constraints or parameters limiting the type of text that is entered, in an otherwise well controlled database with parameters relating to each field), or occasionally it is used to mean ‘full text searching’. In some instances the term is used to refer to the ability to type the search criteria in natural language format without adapting it.
Full Text Searching: Searching all words contained within a document, documents, corpus or corpora as opposed to just meta-data, indices or other summary data.
Fuzzy Matching (see also Lemmatization, Stemming, Synonyms, Character Normalisation, Natural Language Processing): Attempts to return a wider range of relevant results for the terms searched for by using techniques such as stemming, lemmatization and natural language processing. By doing so the search can provide results which are not an exact match for the term(s) searched for, but which it is believed may also be relevant, for example by looking for verb conjugations, plurals and stems.
Index: A list of names or topics and the location (such as a page number) where the topic can be located.
Key Word Search: A search of an electronic store of information (often an online library, document management system or database) in which words are used. Search is made for the specific words entered by the user in the search box. Key word search can often be configured to search just citations and key attributes (such as title(s), author(s), subject(s)) OR to search for the ‘key word(s)’ within the complete document or database record, i.e. the ‘contents’ of the document/record. Key word search can be applied to individual documents, to entire repositories/libraries of information, and across multiple libraries/repositories at the same time (see Federated Search).
Lemmatization: (compare Stemming) Lemmatization is the process of grouping together the different inflected and variant forms of a word. In the IT environment, with electronic stores of text and information, lemmatisation allows the inflected and variant forms of a word to be grouped together and analysed and/or searched for as if they are a single item/word.
From a computing perspective lemmatisation is the algorithmic process of determining the lemma for each word. This task involves understanding the language/dialect in question (including the grammatical structure), the context of the word and determining which part of speech is being used. This makes lemmatization a complex task especially when applied across text written in multiple languages.
Many languages have variant and inflected forms of the same word. For example most English verbs have inflected forms e.g. the verb ‘to drive’ may be written as ‘drive’, ‘driven’, ‘drives’, ‘driving’. The lemma or base form of the word, which links all of these words together, is ‘drive’. In a similar way plurals or nouns are reduced to their singular.
In search terms lemmatizing text ensures that all forms of a particular word within the searched text/repository can be located by searching only for its lemma form OR the technology can associate the lemma form with searches for any of the words inflected or variant forms. Lemmatization therefore increases the potential for a search engine to locate more relevant items based on the variances which the user may not search for/find with key word search. In addition lemmatization overcomes the problem of similar but non-relevant words being retrieved which often occurs when using techniques such as ‘stemming’.
Lemmatization differs from ‘stemming’ primarily due to the understanding of language structure, grammar and the context of the word within the sentence. Results from lemmatization therefore tend to be more pertinent and more accurate than when using stemming.
|Surface Form||Simple Stemming||Lemmatization|
In general, lemmatization cannot be reliably substituted by stemming, which mechanically strips endings to create a stem that approximates a lemma. These stems are frequently non-words which have no linguistic or semantic meaning, or worse, provide completely different meaning and thereby negatively impact search quality. See the table above for an example of how lemmatization provides superior results by understanding the close relationship between “organizing” and “organize,” and not including “organ,” a more distant term.
Linguistics: is the study of language. Linguistics encompasses a number of sub-fields (phonetics, phonology, morphology, syntax, semantics, and psycholinguistics). Linguistics aims to discover the rules and representations underlying the structure of language(s) and what they reveal.
Metadata (sometimes called Metainformation): is data about data or other information; commonly descriptive data about the structure or format of the information e.g. name, size, data type, length, fields, location, ownership, author, date of creation, file type. Aids in the storage and retrieval of data, files, documents, content and other information. Is often used in search criteria to limit search results to pertinent information only, e.g. only file types of ‘.doc’ or ‘HTML’.
Meta Tag: is information placed (often by the author) into the header of a web page or electronic record which provides information (usually regarding the content of the page) which is available to search engines to aid in the location of pertinent information against search requests. Common meta tags include titles, keywords and content descriptions.
Meta-Search Engine: A search tool that automatically queries several search engines at once; retrieving results from each and consolidating them into one results list. These meta-search engines tend to either only search top results lists from each search engine to which the query is passed – this does not give a ‘true’ results list as it does not provide all the information from the resource; or it mixes ‘top results’ ranked information with individual data hits. As such meta-search engines need to be used with caution, and a full appreciation of how the results are obtained and the implications for the person doing the search.
Morphological Variants: (in relation to linguistics) Morphology is a branch of linguistics concerned with the identification, analysis and description of the structure of words and the combination/pattern of words. Morphological variants are aspects of the structure of a word or words (such as roots, stems, and lemmas) which can assist search applications in locating pertinent data against a user search request.
Multi-Lingual Support: (referring to a search application capability) enables an application to either run in multiple languages (i.e. the user interface can be displayed in multiple languages) and/or to run against information stored in multiple languages. For requirements where information is stored in multiple languages and/or where the group of users includes people with different national languages, multi-lingual support enables a single application to provide an intuitive user experience in the preferred language, and can enable location of material in other languages which may be pertinent to the search (and which the user can subsequently have translated if required). Levels of multi-lingual support vary due to the number of languages and dialects around the world. Certain languages are more complex to analyse and to write the algorithms to interpret them and therefore more advanced search functions and techniques such as normalization, lemmatization and morphological analysis are more advanced and effective with certain languages than others. Advanced linguistic analysis is available for Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Italian, Japanese, Korean, Norwegian, Portuguese, Russian, Spanish, Swedish. Basic linguistic support is available for over 50 languages.
N-Grams: consecutive sequences of ‘n’ (number) of characters. Assesses words as strings of characters (normally consecutive characters) e.g. ‘SEARCH’ is composed of the following n-grams (where ‘_’ represents a blank or space):
Some technologies append additional spaces to the beginning and/or end of n-grams to aid with end of word and beginning of word assessment. N-grams are useful for identifying the language a piece of text is written in, for aiding in the automated categorisation of text and for support of semantic searching, which can enable spelling mistakes to have minimal impact on search results. N-grams can assist particularly in situations where text does not have much structure or quality checking during creation (e.g. blogs, forums, free text fields in databases, full text OCR results from scanned documents).
Natural Language Processing: (when referring to search of text and written information) is the ability for the system/software to understand the meaning of search terms, and the content of systems to be searched, in terms of the language they are written in and the meaning of words and combination of words. This requires a level of linguistic capability and morphology within the software to enable users to enter ‘natural language’ search strings (such as an English sentence) and be able to retrieve pertinent results – as opposed to key word searches. This enables users to use normal queries written in their own language and using standard sentence structure rather than having to ‘translate’ their desired query into machine computable search components or key words. In more advanced scenarios this requires an understanding (or the ability to clarify) the difference between what is said/written and what is/was intended.
Normalization: makes a text or language regular and consistent, especially with respect to spelling or style. This can include such techniques as stemming the text, converting the text to lowercase letters and removing accents. Normalization helps to ensure results are unambiguous and as intended even if searching across text in multiple languages.
Ontology: is the study of the categories of things within a domain and defines the terms and concepts used to describe and represent and represent an area of knowledge. Ontology focuses on what entities exist within the information set and how such entities can be grouped, related, and subdivided according to similarities and differences and relationships.
OWL (Web Ontology Language): “is intended to be used when the information contained in documents needs to be processed by applications, as opposed to situations where the content only needs to be presented to humans. OWL can be used to explicitly represent the meaning of terms in vocabularies and the relationships between those terms. This representation of terms and their interrelationships is called an ontology. OWL has more facilities for expressing meaning and semantics than XML, RDF, and RDF-S, and thus OWL goes beyond these languages in its ability to represent machine interpretable content on the Web.” For further information on OWL go to www.w3.org/TR/owl-features
Parametric Search: is a search with multiple parameters, e.g. documents in .pdf format written in 2009, relating to widget ‘x’, or people arrested for knife crimes in x county between January 2008 and January 2009. In most cases one or more parameters can be varied to produce different results. In some cases parameters are strictly defined and in others some flexibility is enabled to allow a greater number of resources to be located.
Parser/Tokeniser: defines each word as a part of speech e.g. verb, noun, adjective and can even define the more specific level of grammar e.g. past, present, future tense; first, second or third person; singular or plural. The parser then assigns a tag to each word reflecting the part of speech it represents and its grammatical function.
Preferred Terms: are terms used to represent a concept or category when indexing, and can be used to provide controlled vocabularies to deliver a level of control and consistency when deploying search applications.
Polyhierarchy: a hierarchy in which there is the ability for any given term to have multiple parents as opposed to a strict hierarchy (monohierarchy) where any given term only has one parent. This complexity has driven the demand for more capable search engines to assist with information location where the navigational structure has become complex.
Resource Description Framework (RDF) (see also OWL): A guideline from the W3C for creating meta-data structures that describe data on the Web. RDF is designed to provide a method for classification of data on Web sites in order to improve searching and navigation. For more on RDF go to www.w3.org/RDF.
Relational Database: a relational database is an electronic collection of records or items (of data) organized in structured tables. Data can be accessed or viewed, ‘computed’ and reassembled in many ways without having to reorganize the database tables. A database is managed by a Relational Database Management System (RDBMS).
Relevance: used by search engines to provide a ranking of results. Uses algorithms to present results and provide a level of ‘relevancy’ of each result to the search requested. This can assist the user in locating the most pertinent information faster, and can provide a logical order for results – especially when there are a significant number of potentially relevant pieces of information located.
Root: (in linguistics) is the primary lexical unit of a word. The root contains the core component(s) of the word’s meaning and cannot be reduced into smaller constituents without changing or losing the lexical and semantic value.
Rule-Based Taxonomy: (see Taxonomy) To simplify enterprise search deployments, the ability to configure a taxonomy of categories and category rules can assist. The taxonomy serves two purposes. First, when the search index is created, taxonomy categories are applied to documents based on whether a document satisfies the rule. Secondly, once categories are applied to documents, the taxonomy can be used to create a browsing interface to the collection. Many navigation-only solutions, require a pre-defined taxonomy, whereas more advanced and flexible enterprise search solutions do not require a pre-defined taxonomy in order to deliver highly relevant search results. However, these more flexible solutions can often take advantage of taxonomy tags to influence both the results and interface of a search application.
Search Results Clustering: Collation of search results into categories (often created at the time of search and based on the results obtained).
Semantic Search: (this refers to ‘semantic search’ and NOT ‘semantic web’ which has come to mean something slightly different). Semantic search requires an understanding of the meaning within language(s) rather than just a word comparison (compare Key Word Search). Semantic search aims to deliver the most pertinent results to a user based on what they meant, rather than just the key words they typed into the search engine. A significant part of this is ‘disambiguation’ – understanding whether ‘Washington’ was intended as a place or a person’s name, whether ‘June’ is a month or a name, whether ‘Rob’ is an activity or a name, whether ‘Jaguar’ is a large cat or a car, whether ‘Polo’ is a mint, European car, a sport etc. Another significant aspect is the understanding of related items and concepts as well as the items within a category or group. For example if you search on “vehicle” a key word search will only locate items containing the word ‘vehicle’. A semantic search will enable results for all ‘types’ of vehicle as well e.g. lorry, car, van, pickup, train, motorbike, etc. and in comprehensive scenarios includes major brands/models such as Ford, GM, Opel, VW, Yamaha. The objective is to produce the most comprehensive set of results, and enable highly relevant search results based on multiple search criteria as well as subsequent navigation of the results. A key point of semantic search is to provide results which are pertinent to the user through understanding the ‘context’ of the search. Semantic search can also allow for common misspellings – whether in the content of the original document or in the search query entered. The use of Natural Language Processing enables and enhances semantic search capability.
SMS Text Message Abbreviations (see also Abbreviations): abbreviations which started to come into common use in SMS text messages sent from mobile devices. Text message abbreviations are like a modern day short-hand based on enabling faster typing of messages using a small keypad with only a limited number of keys (rather than a full alpha-numeric keyboard). Text message abbreviations are commonly used in informal communication by instant message and now more frequently in e-mail. SMS text message abbreviations often contain phonetic abbreviations and substitutions of letters with numbers, for example; R (are), L8 (late), 2day (today). Abbreviations can be combined e.g. “c u l8r” (See you Later).
Social Tagging (see also Folksonomies): Users can save their own keywords (“tags”) for pages/items they found useful, to assist in locating them again in future based on the categorization/keyword(s) they added. The tags are shared across users. Users can also search on tags provided by other users.
Stemming: is the process of reducing inflected and/or derived terms to their stem, base or root form.
Word stemming is based on an algorithmic approach to define a languages rules of word structure and grammar. Every language differs in these aspects and therefore a different algorithm is required for each language.
Word stemming can result in additional search results being found for any given query, and because stemming does not take account of the context of the search terms, nor the part of speech it represents, it can result in a significant number of results for similar words which have substantially different meanings, and therefore can retrieve a substantial amount of information which is not relevant/pertinent to the search (it increases the incidence of ‘false positives’).
The ability for a search to include the "stem" of words by removing prefixes and suffixes.
Lemmatization differs from ‘stemming’ primarily due to the understanding of language structure, grammar and the context of the word within the sentence. Results from lemmatization therefore tend to be more pertinent and more accurate than when using stemming.
|Surface Form||Simple Stemming||Lemmatization|
In general, lemmatization cannot be reliably substituted by stemming, which mechanically strips endings to create a stem that approximates a lemma. These stems are frequently non-words which have no linguistic or semantic meaning, or worse, provide completely different meaning and thereby negatively impact search quality. See the table above for an example of how lemmatization provides superior results by understanding the close relationship between “organizing” and “organize”, and not including “organ”, a more distant term.
Stop Words / Stopwords: are words/terms which can be removed from a search string to improve relevancy of the results. These can be generic or are often enterprise specific. Typically stop words are commonly occurring words or phrases which, if they were included in the search query might cause a large quantity of poor results. Once defined stop words are ignored in a query. Each search engine defines its own stop words. Common stop words include prepositions (to, at, on, etc.), and definite articles (the, a). Some organisations may chose to add common industry terms as stop words when searching a topic specific information store databases e.g. ‘policy’ when searching a ‘policy database’. Stop words need to considered carefully in enterprise search situations where entity resolution is required and in relationship analysis scenarios where prepositions can be valuable in defining the relationship or activity of an entity.
Synonyms: words having the same (or almost the same) meaning as another or a word or expression which can be substituted for another. In search terms being able to search for synonyms automatically without having to type all the alternative variants increases the speed and productivity of searching, whilst providing greater access to information likely to be pertinent to the search. This functionality is particularly valuable in searching for complete spellings of common enterprise acronyms and the acronyms of common enterprise terms e.g. CRM (Customer Relationship Management), Volkswagen (VW), ATV (All Terrain Vehicle), MS Office (Microsoft Office).
Taxonomy: a hierarchical structure used to describe and classify an item in relation to other items in the structure. For enterprise search applications, the taxonomy provides the hierarchical structure to aid the creation and storage of content in a manner which enables logical navigation of the content based on a structured set of classifications or ‘types’. Whereas in biology a given organism can only fit in one place within the taxonomy, in information taxonomies a given piece of information can logically fit into several taxonomic groups. Organisational and information management policies may define much of the taxonomic structure to maintain consistency for information storage and retrieval, however, the complexity of information and information types means that misplacing of items is common and thus standard hierarchical, taxonomy only, navigational approaches can be time consuming and frustrating for those searching for information.
Text Segmentation: is required to deliver high precision for non-white-spaced languages, such as Chinese and Japanese. Segmentation is the process by which input text is broken down into distinct lexical units. This process includes some of the following linguistic processing capabilities Contraction Splitting, Lemmatization, Abbreviation Recognition, End of Sentence Marker Recognition, Word Segmentation.
Thesaurus: is a listing of words and alternative words or phrases with a similar or largely similar meaning to the initial word (e.g. synonyms). Used in some search applications to enable the automatic inclusion of listed variants and synonyms within a given search.
UIMA (Unstructured Information Management Architecture): is a framework for software systems that analyze large volumes of unstructured information, in order to discover knowledge that is relevant to an end user. Donated to the Apache incubator project whose goal is “a thriving community of users and developers of UIMA frameworks, tools, and annotators, facilitating the analysis of unstructured content such as text, audio and video.” UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.
UIMA additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes.
For more information on UIMA go to: incubator.apache.org/uima
Unstructured Content: is any item which is not ‘structured data’ (which is data structured and confined within restricted and defined fields in a database). This is therefore the vast majority of information and includes free text fields in databases, text, documents, presentations, e-mail, pictures, images, graphics, print output, audio, video, intranet and internet pages
Variant Spellings: are alternative accepted spellings of the same word and are common in many languages. For effective search results, variant spellings need to be taken into consideration. For example, in German there are differences due to traditional and reformed spelling rules; in English, there are U.S. and U.K. variants; in Chinese, there are Traditional and Simplified versions; and in Japanese, you frequently see phonetic spelling variants.
Wildcard: is usually a defined symbol, usually an asterisk (*) or a question mark (?), that can be entered into a search query in place of one or more characters in a word or term. The effect is to search for any word or term which matches the term searched for, where the wildcard symbol can be matched by any letter (or symbol). E.g. searching for ‘Robert*’ may produce results including ‘Robert’, ‘Roberto’, ‘Robertson’ Wildcards can often be included at the beginning, the end, or in the middle of a search term.
Word Normalization: (often called Lemmatization) is important for any language which inflects, but it is absolutely critical to achieve reasonable search results/recall for highly inflected languages, such as Russian. See Lemmatization.
Word Segmentation: enables the breakdown of words into their semantic units and is important for compounding languages, such as German. Word segmentation is also used for languages that do not use white spaces (or delimiters) between words, such as Japanese and Chinese. By breaking down the words into their semantic units search results are enhanced by being able to understand better the content and meaning of text.