← Back to Databases

Elasticsearch Explained: Full-Text Search for Beginners

Elasticsearch is a distributed search engine built on top of Apache Lucene that enables fast, relevant full-text search across massive datasets. Unlike traditional SQL databases, it's optimized for speed and relevance scoring, making it the backbone of search features in apps like Netflix, Slack, and GitHub.

What Is Elasticsearch and Why Does It Matter?

Most databases excel at exact matches—finding a user by ID or filtering rows by status. But searching for "best Italian restaurants near me" across millions of reviews? That's where traditional databases struggle. Elasticsearch was built for exactly this problem.

At its core, Elasticsearch works differently than a SQL database. Instead of storing data in rows and columns, it stores documents as JSON objects organized into indexes. Each index is like a table, but with built-in text analysis, relevance ranking, and lightning-fast retrieval.

Here's why it matters: search speed. A SQL query scanning millions of rows might take seconds. Elasticsearch can find relevant results in milliseconds, even across billions of documents. This speed comes from inverted indexes—a data structure that maps words to the documents containing them.

Understanding Inverted Indexes

The secret to Elasticsearch's speed is the inverted index. While a normal index maps documents to their contents, an inverted index maps contents (words) to documents containing them.

Imagine three product descriptions:

A traditional index would list each document and its words. An inverted index reverses this:

When you search for "blue running," Elasticsearch finds "blue" → [Doc 1, Doc 3] and "running" → [Doc 1, Doc 2], then returns Doc 1 because it contains both terms. This lookup is instant, regardless of database size.

Core Concepts: Clusters, Nodes, and Shards

Elasticsearch is designed for scale. Understanding its architecture helps you grasp why it's reliable and fast.

Nodes

A node is a single Elasticsearch server instance. Each node holds a portion of your data and processes requests. You can run multiple nodes on different machines to distribute load.

Cluster

A cluster is a group of nodes working together. They share data, coordinate searches, and automatically handle failover if a node goes down. A single-node cluster is valid for development but risky for production.

Shards

An index is divided into shards—smaller chunks distributed across nodes. If an index has 3 shards, each shard lives on a different node (ideally). This parallelizes search: Elasticsearch queries all shards simultaneously, then merges results. Faster retrieval at scale.

Replicas

Replicas are copies of shards for reliability. If a node fails, replicas ensure data isn't lost. They also boost search performance since queries can hit replicas instead of primary shards.

Creating and Managing Indexes

Let's get practical. First, you'll create an index—the container for your searchable data. You can interact with Elasticsearch via REST API.

To create an index called "products":

curl -X PUT "localhost:9200/products" \
  -H "Content-Type: application/json" \
  -d '{
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 1
    },
    "mappings": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "standard"
        },
        "price": {
          "type": "float"
        },
        "category": {
          "type": "keyword"
        }
      }
    }
  }'

Let's break this down. The settings define 1 shard and 1 replica. The mappings define your document structure.

Notice the "text" type for "name"—this tells Elasticsearch to tokenize and analyze the field for full-text search. The "keyword" type for "category" keeps it intact for exact matching (no tokenization). Price is a float, so Elasticsearch knows to handle it numerically.

Indexing Documents

Now add documents (data) to your index:

curl -X POST "localhost:9200/products/_doc" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "lightweight running shoes",
    "price": 89.99,
    "category": "footwear"
  }'

Elasticsearch assigns an ID and indexes the document. The "_doc" endpoint adds a document to the index. Want to specify an ID?

curl -X POST "localhost:9200/products/_doc/1" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "premium hiking boots",
    "price": 149.99,
    "category": "footwear"
  }'

That document gets ID "1". Add a few more, and you're ready to search.

Full-Text Search Queries

Searching is where Elasticsearch shines. Use the match query for full-text search:

curl -X GET "localhost:9200/products/_search" \
  -H "Content-Type: application/json" \
  -d '{
    "query": {
      "match": {
        "name": "running shoes"
      }
    }
  }'

Elasticsearch tokenizes "running shoes" into "running" and "shoes," then finds documents matching either term. Documents matching both rank higher (relevance scoring). Results come back with a "score" field—higher scores mean better matches.

For exact phrase matching, use match_phrase:

curl -X GET "localhost:9200/products/_search" \
  -H "Content-Type: application/json" \
  -d '{
    "query": {
      "match_phrase": {
        "name": "running shoes"
      }
    }
  }'

This returns only documents containing the exact phrase "running shoes," in that order.

For filtering by exact values (like category), use a term query or bool query:

curl -X GET "localhost:9200/products/_search" \
  -H "Content-Type: application/json" \
  -d '{
    "query": {
      "bool": {
        "must": [
          { "match": { "name": "shoes" } }
        ],
        "filter": [
          { "term": { "category": "footwear" } }
        ]
      }
    }
  }'

The bool query combines conditions. "Must" clauses affect scoring (relevance). "Filter" clauses just include/exclude documents without affecting scores. This is efficient—filter results are cached.

Relevance Scoring and Ranking

Elasticsearch ranks results using TF-IDF (Term Frequency-Inverse Document Frequency) by default. Here's what it means:

Term Frequency (TF): How often does a search term appear in a document? More occurrences = higher relevance.

Inverse Document Frequency (IDF): How rare is the term across all documents? Rare terms boost relevance more than common ones. "Shoe" is common in a product index, but "waterproof" is rarer—finding "waterproof" matters more.

Elasticsearch multiplies TF × IDF to score each document, then ranks by score. You can see scores in search results:

{
  "hits": {
    "total": { "value": 2, "relation": "eq" },
    "hits": [
      {
        "_id": "1",
        "_score": 2.5,
        "_source": { "name": "premium hiking boots" }
      },
      {
        "_id": "2",
        "_score": 1.1,
        "_source": { "name": "casual shoes" }
      }
    ]
  }
}

Document 1 scored 2.5, Document 2 scored 1.1. Elasticsearch returns them in score order—highest first.

Analyzers and Text Processing

Before indexing, Elasticsearch processes text through an analyzer. The default "standard" analyzer does three things:

  1. Tokenization: Splits text into words ("running shoes" → ["running", "shoes"])
  2. Lowercasing: Converts to lowercase ("Running" → "running")
  3. Stop word removal: Removes common words like "the," "a," "is" (optional)

This is why searching "running shoes" finds documents with "Running Shoes" or "RUNNING SHOES"—case doesn't matter anymore.

You can customize analyzers or use built-in ones like "english" (which removes English stop words and applies stemming—converting "running" and "runs" to a common root).

Define a custom analyzer in your mapping:

curl -X PUT "localhost:9200/products" \
  -H "Content-Type: application/json" \
  -d '{
    "settings": {
      "analysis": {
        "analyzer": {
          "custom_analyzer": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": ["lowercase", "stop"]
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "custom_analyzer"
        }
      }
    }
  }'

Common Pitfalls and Best Practices

Don't confuse text and keyword types. Text fields are analyzed for search; keyword fields aren't. Use text for searchable content, keyword for exact matches (IDs, tags, status).

Monitor shard size. Aim for shards between 10–50GB. Too many small shards hurt performance; too few large shards limit parallelization.

Use filters, not queries, for non-scoring operations. Filters are cached and faster. Reserve queries for relevance-sensitive searches.

Index strategically. Don't index huge text fields you'll never search. Analyzers add overhead—only apply them where needed.

Test with realistic data. Performance characteristics change with scale. Benchmark with production-like datasets before deploying.

When to Use Elasticsearch

Elasticsearch excels at full-text search, log analysis, analytics dashboards, and autocomplete. It's overkill for simple exact-match queries your SQL database handles fine.

Good use cases: e-commerce product search, content management system queries, application logs, real-time analytics. Bad use cases: transactional data with strict consistency requirements, simple CRUD operations.

One more thing: Elasticsearch is resource-heavy. It consumes significant RAM and CPU. For small projects, simpler solutions (PostgreSQL with full-text search or SQLite FTS) might suffice. Start simple, migrate to Elasticsearch when you hit performance walls.

Getting Started

Install Elasticsearch locally using Docker:

docker run -d --