POSTGRESQL: mudanças entre as edições
(Criou página com '=PostgreSQL= <pre> \l /dt create database zproxy create user zproxy with password 'xpto'; grant all privileges on database zproxy to zproxy; \connect zproxy \password zproxy <...') |
Sem resumo de edição |
||
| Linha 1: | Linha 1: | ||
=PostgreSQL= | = PostgreSQL Full-Text Search: Concepts, SQL Usage, and Integration with Java and Python = | ||
< | |||
PostgreSQL provides a powerful, native Full-Text Search (FTS) engine designed to efficiently index and search large volumes of textual data. Unlike simple string matching using LIKE or ILIKE, PostgreSQL Full-Text Search performs linguistic analysis, tokenization, stemming, stop-word removal, and relevance ranking. | |||
/ | |||
This article starts by introducing Full-Text Search directly at the SQL level, then explains its internal concepts, and finally demonstrates recommended, production-grade usage from Java and Python applications using Spring Data JPA and Peewee. | |||
== Getting Started with PostgreSQL Full-Text Search (Pure SQL) == | |||
PostgreSQL Full-Text Search is built into the database and can be used without any external libraries or frameworks. | |||
</ | |||
=== A Basic Example === | |||
Consider a simple table storing articles: | |||
<syntaxhighlight lang="sql"> | |||
CREATE TABLE article ( | |||
id SERIAL PRIMARY KEY, | |||
title TEXT, | |||
body TEXT | |||
); | |||
</syntaxhighlight> | |||
A basic full-text search can be performed by converting text into a <code>tsvector</code> and comparing it with a <code>tsquery</code>: | |||
<syntaxhighlight lang="sql"> | |||
SELECT * | |||
FROM article | |||
WHERE to_tsvector('portuguese', title || ' ' || body) | |||
@@ plainto_tsquery('portuguese', 'postgres database'); | |||
</syntaxhighlight> | |||
This query finds rows whose text semantically matches the words “postgres” and “database”, regardless of word order, case, or inflection. | |||
=== Why Not LIKE? === | |||
Traditional SQL searches often rely on: | |||
<syntaxhighlight lang="sql"> | |||
WHERE body ILIKE '%postgres%' | |||
</syntaxhighlight> | |||
This approach: | |||
* Cannot use linguistic analysis | |||
* Does not scale well on large datasets | |||
* Fails to rank results by relevance | |||
PostgreSQL Full-Text Search solves these limitations. | |||
=== Using plainto_tsquery === | |||
<code>plainto_tsquery</code> is designed for user input. It automatically: | |||
* Splits text into terms | |||
* Applies the language dictionary | |||
* Combines terms using logical AND | |||
<syntaxhighlight lang="sql"> | |||
SELECT * | |||
FROM article | |||
WHERE to_tsvector('portuguese', body) | |||
@@ plainto_tsquery('portuguese', 'powerful database'); | |||
</syntaxhighlight> | |||
=== Using to_tsquery (Advanced Syntax) === | |||
When the query syntax is controlled, <code>to_tsquery</code> allows explicit logical operators: | |||
<syntaxhighlight lang="sql"> | |||
SELECT * | |||
FROM article | |||
WHERE to_tsvector('portuguese', body) | |||
@@ to_tsquery('portuguese', 'postgres & (database | data)'); | |||
</syntaxhighlight> | |||
Prefix matching is also supported: | |||
<syntaxhighlight lang="sql"> | |||
WHERE to_tsvector('portuguese', body) | |||
@@ to_tsquery('portuguese', 'postgre:*'); | |||
</syntaxhighlight> | |||
== Core Concepts of PostgreSQL Full-Text Search == | |||
PostgreSQL Full-Text Search relies on two specialized data types: | |||
; tsvector | |||
: A normalized and indexed representation of a document. Text is tokenized, lowercased, stripped of stop words, and stemmed according to a language dictionary. | |||
; tsquery | |||
: A structured search query containing terms and logical operators such as AND (&), OR (|), NOT (!), and prefix matching (:*). | |||
=== Example Transformation === | |||
The text: | |||
<code>PostgreSQL is a powerful database</code> | |||
may be transformed into a <code>tsvector</code> similar to: | |||
<code>'databas':4 'power':3 'postgresql':1</code> | |||
A user query such as: | |||
<code>powerful database</code> | |||
becomes a <code>tsquery</code> equivalent to: | |||
<code>'power' & 'databas'</code> | |||
== The @@ Operator == | |||
The <code>@@</code> operator is the central operator of PostgreSQL Full-Text Search: | |||
<syntaxhighlight lang="sql"> | |||
tsvector @@ tsquery | |||
</syntaxhighlight> | |||
It evaluates whether a document matches a full-text query and returns a boolean result. | |||
Conceptually, it can be read as: | |||
<blockquote> | |||
"Does this indexed document match this search query?" | |||
</blockquote> | |||
The <code>@@</code> operator: | |||
* Executes the full-text match | |||
* Enables the use of GIN or GiST indexes | |||
* Is required for ranking and highlighting | |||
* Has no equivalent in standard SQL or JPQL | |||
Without <code>@@</code>, PostgreSQL Full-Text Search cannot function. | |||
== Indexing with GIN == | |||
For performance, PostgreSQL Full-Text Search relies on inverted indexes, most commonly GIN (Generalized Inverted Index). | |||
A recommended approach is to store a dedicated <code>tsvector</code> column and index it: | |||
<syntaxhighlight lang="sql"> | |||
ALTER TABLE article | |||
ADD COLUMN fts tsvector | |||
GENERATED ALWAYS AS ( | |||
setweight(to_tsvector('portuguese', coalesce(title,'')), 'A') || | |||
setweight(to_tsvector('portuguese', coalesce(body,'')), 'B') | |||
) STORED; | |||
CREATE INDEX idx_article_fts | |||
ON article USING GIN (fts); | |||
</syntaxhighlight> | |||
This approach avoids triggers and guarantees consistency. | |||
== Ranking and Highlighting == | |||
PostgreSQL provides ranking functions such as <code>ts_rank</code> and <code>ts_rank_cd</code> to score documents by relevance. | |||
Highlighting is handled by <code>ts_headline</code>, which extracts and emphasizes relevant fragments from the original text. | |||
== Using PostgreSQL Full-Text Search from Java (Spring Data JPA) == | |||
Spring Data JPA does not support PostgreSQL Full-Text Search operators directly in JPQL. The recommended approach is to use native SQL queries while keeping entity management and CRUD operations within JPA. | |||
=== Entity Mapping === | |||
<syntaxhighlight lang="java"> | |||
@Entity | |||
@Table(name = "article") | |||
public class Article { | |||
@Id | |||
@GeneratedValue(strategy = GenerationType.IDENTITY) | |||
private Long id; | |||
private String title; | |||
@Column(columnDefinition = "text") | |||
private String body; | |||
@Column(name = "fts", insertable = false, updatable = false) | |||
private String fts; | |||
} | |||
</syntaxhighlight> | |||
=== Repository with Full-Text Search === | |||
<syntaxhighlight lang="java"> | |||
public interface ArticleRepository extends JpaRepository<Article, Long> { | |||
@Query(value = """ | |||
SELECT a.* | |||
FROM article a | |||
WHERE a.fts @@ plainto_tsquery('portuguese', :query) | |||
ORDER BY ts_rank(a.fts, plainto_tsquery('portuguese', :query)) DESC | |||
""", nativeQuery = true) | |||
List<Article> search(@Param("query") String query); | |||
} | |||
</syntaxhighlight> | |||
== Using PostgreSQL Full-Text Search from Python (Peewee) == | |||
Peewee is a lightweight ORM that allows direct access to PostgreSQL operators and functions, making it well-suited for Full-Text Search when used with explicit SQL expressions. | |||
=== Model Definition === | |||
<syntaxhighlight lang="python"> | |||
from peewee import * | |||
from playhouse.postgres_ext import PostgresqlExtDatabase | |||
db = PostgresqlExtDatabase("example_db") | |||
class BaseModel(Model): | |||
class Meta: | |||
database = db | |||
class Article(BaseModel): | |||
title = TextField() | |||
body = TextField() | |||
fts = TextField(null=True) | |||
</syntaxhighlight> | |||
=== Full-Text Search Query === | |||
<syntaxhighlight lang="python"> | |||
from peewee import fn, SQL | |||
query_text = "postgres database" | |||
query = ( | |||
Article | |||
.select( | |||
Article, | |||
fn.ts_rank( | |||
Article.fts, | |||
fn.plainto_tsquery('portuguese', query_text) | |||
).alias('rank') | |||
) | |||
.where( | |||
Article.fts.op('@@')( | |||
fn.plainto_tsquery('portuguese', query_text) | |||
) | |||
) | |||
.order_by(SQL('rank').desc()) | |||
) | |||
for article in query: | |||
print(article.title, article.rank) | |||
</syntaxhighlight> | |||
== ORM vs Search Engines == | |||
PostgreSQL Full-Text Search is sufficient for most applications, including: | |||
* Blog and article search | |||
* Documentation systems | |||
* Administrative and internal tools | |||
* Moderate autocomplete requirements | |||
Dedicated search engines such as Elasticsearch or OpenSearch are recommended only when requirements include advanced linguistic analysis, distributed indexing, or large-scale search analytics. | |||
== Best Practices == | |||
* Always use a dedicated <code>tsvector</code> column | |||
* Always create a GIN index on the <code>tsvector</code> | |||
* Prefer <code>plainto_tsquery</code> or <code>websearch_to_tsquery</code> for user input | |||
* Use <code>to_tsquery</code> only when the query syntax is controlled | |||
* Keep Full-Text Search logic inside PostgreSQL | |||
* Use ORM abstractions for CRUD and native SQL for search | |||
== Conclusion == | |||
PostgreSQL Full-Text Search is a mature, efficient, and production-ready solution that integrates naturally with both Java and Python ecosystems. | |||
By starting with SQL fundamentals and layering ORM integration on top, applications can achieve high-performance, semantically rich search without the operational complexity of external search engines. | |||
== Web-Style Search with websearch_to_tsquery == | |||
PostgreSQL provides <code>websearch_to_tsquery</code>, a function designed to accept search strings written in a format similar to modern web search engines (such as Google). | |||
Unlike <code>to_tsquery</code>, which requires explicit operators, and <code>plainto_tsquery</code>, which always combines terms with AND, <code>websearch_to_tsquery</code> supports a more expressive and user-friendly syntax. | |||
=== Supported Syntax === | |||
<code>websearch_to_tsquery</code> understands: | |||
* Space-separated terms (implicit AND) | |||
* OR using the keyword <code>OR</code> | |||
* Quoted phrases | |||
* Unary minus for negation | |||
=== Examples === | |||
Basic search (implicit AND): | |||
<syntaxhighlight lang="sql"> | |||
SELECT * | |||
FROM article | |||
WHERE fts @@ websearch_to_tsquery('portuguese', 'postgres database'); | |||
</syntaxhighlight> | |||
OR operator: | |||
<syntaxhighlight lang="sql"> | |||
WHERE fts @@ websearch_to_tsquery('portuguese', 'postgres OR mysql'); | |||
</syntaxhighlight> | |||
Quoted phrase: | |||
<syntaxhighlight lang="sql"> | |||
WHERE fts @@ websearch_to_tsquery('portuguese', '"full text search"'); | |||
</syntaxhighlight> | |||
Negation: | |||
<syntaxhighlight lang="sql"> | |||
WHERE fts @@ websearch_to_tsquery('portuguese', 'postgres -mysql'); | |||
</syntaxhighlight> | |||
=== Why websearch_to_tsquery Matters === | |||
This function is particularly well-suited for direct user input because: | |||
* It avoids syntax errors common with <code>to_tsquery</code> | |||
* It feels natural to end users | |||
* It reduces the need for custom query parsing in the application layer | |||
For most user-facing search boxes, <code>websearch_to_tsquery</code> is the recommended choice. | |||
== FTS vs pg_trgm == | |||
PostgreSQL offers two powerful but fundamentally different approaches to text search: Full-Text Search (FTS) and trigram similarity via the <code>pg_trgm</code> extension. | |||
Understanding their differences is essential for choosing the correct tool. | |||
=== pg_trgm Overview === | |||
The <code>pg_trgm</code> extension indexes text by breaking it into overlapping sequences of three characters (trigrams). It enables fast similarity searches and partial matching. | |||
To enable it: | |||
<syntaxhighlight lang="sql"> | |||
CREATE EXTENSION IF NOT EXISTS pg_trgm; | |||
</syntaxhighlight> | |||
A typical trigram index: | |||
<syntaxhighlight lang="sql"> | |||
CREATE INDEX idx_article_trgm | |||
ON article | |||
USING GIN (body gin_trgm_ops); | |||
</syntaxhighlight> | |||
Example query: | |||
<syntaxhighlight lang="sql"> | |||
SELECT * | |||
FROM article | |||
WHERE body ILIKE '%postgre%'; | |||
</syntaxhighlight> | |||
=== Key Differences === | |||
{| class="wikitable" | |||
! Feature !! PostgreSQL FTS (@@) !! pg_trgm | |||
|- | |||
| Linguistic analysis || Yes (stemming, stop words) || No | |||
|- | |||
| Word-based search || Yes || No | |||
|- | |||
| Partial word matching || Limited (:*) || Yes | |||
|- | |||
| Typo tolerance || Low || High | |||
|- | |||
| Ranking by relevance || Yes (ts_rank) || Limited (similarity) | |||
|- | |||
| Language awareness || Yes || No | |||
|- | |||
| Index size || Moderate || Larger | |||
|- | |||
| Best use case || Semantic text search || Fuzzy / partial matching | |||
|} | |||
=== When to Use Full-Text Search == | |||
Use PostgreSQL Full-Text Search when: | |||
* Searching long documents or articles | |||
* Language semantics matter | |||
* You need relevance ranking | |||
* Queries are word-based | |||
* You want deterministic, explainable results | |||
=== When to Use pg_trgm == | |||
Use <code>pg_trgm</code> when: | |||
* Implementing autocomplete | |||
* Supporting typo-tolerant search | |||
* Matching substrings or partial words | |||
* Searching identifiers, codes, or names | |||
=== Combining FTS and pg_trgm === | |||
In many real-world systems, the best solution is to combine both approaches. | |||
Example: | |||
* Use FTS (<code>@@</code>) for main content search | |||
* Use <code>pg_trgm</code> for autocomplete or fallback fuzzy matching | |||
* Merge or rank results at the application layer | |||
This hybrid approach often delivers search quality comparable to dedicated search engines while keeping all logic inside PostgreSQL. | |||
Edição atual tal como às 18h13min de 22 de dezembro de 2025
PostgreSQL Full-Text Search: Concepts, SQL Usage, and Integration with Java and Python
PostgreSQL provides a powerful, native Full-Text Search (FTS) engine designed to efficiently index and search large volumes of textual data. Unlike simple string matching using LIKE or ILIKE, PostgreSQL Full-Text Search performs linguistic analysis, tokenization, stemming, stop-word removal, and relevance ranking.
This article starts by introducing Full-Text Search directly at the SQL level, then explains its internal concepts, and finally demonstrates recommended, production-grade usage from Java and Python applications using Spring Data JPA and Peewee.
Getting Started with PostgreSQL Full-Text Search (Pure SQL)
PostgreSQL Full-Text Search is built into the database and can be used without any external libraries or frameworks.
A Basic Example
Consider a simple table storing articles:
CREATE TABLE article (
id SERIAL PRIMARY KEY,
title TEXT,
body TEXT
);
A basic full-text search can be performed by converting text into a tsvector and comparing it with a tsquery:
SELECT *
FROM article
WHERE to_tsvector('portuguese', title || ' ' || body)
@@ plainto_tsquery('portuguese', 'postgres database');
This query finds rows whose text semantically matches the words “postgres” and “database”, regardless of word order, case, or inflection.
Why Not LIKE?
Traditional SQL searches often rely on:
WHERE body ILIKE '%postgres%'
This approach:
- Cannot use linguistic analysis
- Does not scale well on large datasets
- Fails to rank results by relevance
PostgreSQL Full-Text Search solves these limitations.
Using plainto_tsquery
plainto_tsquery is designed for user input. It automatically:
- Splits text into terms
- Applies the language dictionary
- Combines terms using logical AND
SELECT *
FROM article
WHERE to_tsvector('portuguese', body)
@@ plainto_tsquery('portuguese', 'powerful database');
Using to_tsquery (Advanced Syntax)
When the query syntax is controlled, to_tsquery allows explicit logical operators:
SELECT *
FROM article
WHERE to_tsvector('portuguese', body)
@@ to_tsquery('portuguese', 'postgres & (database | data)');
Prefix matching is also supported:
WHERE to_tsvector('portuguese', body)
@@ to_tsquery('portuguese', 'postgre:*');
Core Concepts of PostgreSQL Full-Text Search
PostgreSQL Full-Text Search relies on two specialized data types:
- tsvector
- A normalized and indexed representation of a document. Text is tokenized, lowercased, stripped of stop words, and stemmed according to a language dictionary.
- tsquery
- A structured search query containing terms and logical operators such as AND (&), OR (|), NOT (!), and prefix matching (:*).
Example Transformation
The text:
PostgreSQL is a powerful database
may be transformed into a tsvector similar to:
'databas':4 'power':3 'postgresql':1
A user query such as:
powerful database
becomes a tsquery equivalent to:
'power' & 'databas'
The @@ Operator
The @@ operator is the central operator of PostgreSQL Full-Text Search:
tsvector @@ tsquery
It evaluates whether a document matches a full-text query and returns a boolean result.
Conceptually, it can be read as:
"Does this indexed document match this search query?"
The @@ operator:
- Executes the full-text match
- Enables the use of GIN or GiST indexes
- Is required for ranking and highlighting
- Has no equivalent in standard SQL or JPQL
Without @@, PostgreSQL Full-Text Search cannot function.
Indexing with GIN
For performance, PostgreSQL Full-Text Search relies on inverted indexes, most commonly GIN (Generalized Inverted Index).
A recommended approach is to store a dedicated tsvector column and index it:
ALTER TABLE article
ADD COLUMN fts tsvector
GENERATED ALWAYS AS (
setweight(to_tsvector('portuguese', coalesce(title,'')), 'A') ||
setweight(to_tsvector('portuguese', coalesce(body,'')), 'B')
) STORED;
CREATE INDEX idx_article_fts
ON article USING GIN (fts);
This approach avoids triggers and guarantees consistency.
Ranking and Highlighting
PostgreSQL provides ranking functions such as ts_rank and ts_rank_cd to score documents by relevance.
Highlighting is handled by ts_headline, which extracts and emphasizes relevant fragments from the original text.
Using PostgreSQL Full-Text Search from Java (Spring Data JPA)
Spring Data JPA does not support PostgreSQL Full-Text Search operators directly in JPQL. The recommended approach is to use native SQL queries while keeping entity management and CRUD operations within JPA.
Entity Mapping
@Entity
@Table(name = "article")
public class Article {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
private String title;
@Column(columnDefinition = "text")
private String body;
@Column(name = "fts", insertable = false, updatable = false)
private String fts;
}
Repository with Full-Text Search
public interface ArticleRepository extends JpaRepository<Article, Long> {
@Query(value = """
SELECT a.*
FROM article a
WHERE a.fts @@ plainto_tsquery('portuguese', :query)
ORDER BY ts_rank(a.fts, plainto_tsquery('portuguese', :query)) DESC
""", nativeQuery = true)
List<Article> search(@Param("query") String query);
}
Using PostgreSQL Full-Text Search from Python (Peewee)
Peewee is a lightweight ORM that allows direct access to PostgreSQL operators and functions, making it well-suited for Full-Text Search when used with explicit SQL expressions.
Model Definition
from peewee import *
from playhouse.postgres_ext import PostgresqlExtDatabase
db = PostgresqlExtDatabase("example_db")
class BaseModel(Model):
class Meta:
database = db
class Article(BaseModel):
title = TextField()
body = TextField()
fts = TextField(null=True)
Full-Text Search Query
from peewee import fn, SQL
query_text = "postgres database"
query = (
Article
.select(
Article,
fn.ts_rank(
Article.fts,
fn.plainto_tsquery('portuguese', query_text)
).alias('rank')
)
.where(
Article.fts.op('@@')(
fn.plainto_tsquery('portuguese', query_text)
)
)
.order_by(SQL('rank').desc())
)
for article in query:
print(article.title, article.rank)
ORM vs Search Engines
PostgreSQL Full-Text Search is sufficient for most applications, including:
- Blog and article search
- Documentation systems
- Administrative and internal tools
- Moderate autocomplete requirements
Dedicated search engines such as Elasticsearch or OpenSearch are recommended only when requirements include advanced linguistic analysis, distributed indexing, or large-scale search analytics.
Best Practices
- Always use a dedicated
tsvectorcolumn - Always create a GIN index on the
tsvector - Prefer
plainto_tsqueryorwebsearch_to_tsqueryfor user input - Use
to_tsqueryonly when the query syntax is controlled - Keep Full-Text Search logic inside PostgreSQL
- Use ORM abstractions for CRUD and native SQL for search
Conclusion
PostgreSQL Full-Text Search is a mature, efficient, and production-ready solution that integrates naturally with both Java and Python ecosystems.
By starting with SQL fundamentals and layering ORM integration on top, applications can achieve high-performance, semantically rich search without the operational complexity of external search engines.
Web-Style Search with websearch_to_tsquery
PostgreSQL provides websearch_to_tsquery, a function designed to accept search strings written in a format similar to modern web search engines (such as Google).
Unlike to_tsquery, which requires explicit operators, and plainto_tsquery, which always combines terms with AND, websearch_to_tsquery supports a more expressive and user-friendly syntax.
Supported Syntax
websearch_to_tsquery understands:
- Space-separated terms (implicit AND)
- OR using the keyword
OR - Quoted phrases
- Unary minus for negation
Examples
Basic search (implicit AND):
SELECT *
FROM article
WHERE fts @@ websearch_to_tsquery('portuguese', 'postgres database');
OR operator:
WHERE fts @@ websearch_to_tsquery('portuguese', 'postgres OR mysql');
Quoted phrase:
WHERE fts @@ websearch_to_tsquery('portuguese', '"full text search"');
Negation:
WHERE fts @@ websearch_to_tsquery('portuguese', 'postgres -mysql');
Why websearch_to_tsquery Matters
This function is particularly well-suited for direct user input because:
- It avoids syntax errors common with
to_tsquery - It feels natural to end users
- It reduces the need for custom query parsing in the application layer
For most user-facing search boxes, websearch_to_tsquery is the recommended choice.
FTS vs pg_trgm
PostgreSQL offers two powerful but fundamentally different approaches to text search: Full-Text Search (FTS) and trigram similarity via the pg_trgm extension.
Understanding their differences is essential for choosing the correct tool.
pg_trgm Overview
The pg_trgm extension indexes text by breaking it into overlapping sequences of three characters (trigrams). It enables fast similarity searches and partial matching.
To enable it:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
A typical trigram index:
CREATE INDEX idx_article_trgm
ON article
USING GIN (body gin_trgm_ops);
Example query:
SELECT *
FROM article
WHERE body ILIKE '%postgre%';
Key Differences
| Feature | PostgreSQL FTS (@@) | pg_trgm |
|---|---|---|
| Linguistic analysis | Yes (stemming, stop words) | No |
| Word-based search | Yes | No |
| Partial word matching | Limited (:*) | Yes |
| Typo tolerance | Low | High |
| Ranking by relevance | Yes (ts_rank) | Limited (similarity) |
| Language awareness | Yes | No |
| Index size | Moderate | Larger |
| Best use case | Semantic text search | Fuzzy / partial matching |
= When to Use Full-Text Search
Use PostgreSQL Full-Text Search when:
- Searching long documents or articles
- Language semantics matter
- You need relevance ranking
- Queries are word-based
- You want deterministic, explainable results
= When to Use pg_trgm
Use pg_trgm when:
- Implementing autocomplete
- Supporting typo-tolerant search
- Matching substrings or partial words
- Searching identifiers, codes, or names
Combining FTS and pg_trgm
In many real-world systems, the best solution is to combine both approaches.
Example:
- Use FTS (
@@) for main content search - Use
pg_trgmfor autocomplete or fallback fuzzy matching - Merge or rank results at the application layer
This hybrid approach often delivers search quality comparable to dedicated search engines while keeping all logic inside PostgreSQL.