In the field of text summarization, there are two primary categories of summarization, extractive and abstractive summarization.
Extractive summarization takes subsections of the text and joins them together to form a summary. This is commonly backed by graph algorithms like TextRank to find the sections/sentences with the most commonality. These summaries can be highly effective but they are unable to transform text and don't have a contextual understanding.
Abstractive summarization uses Natural Language Processing (NLP) models to build transformative summaries of text. This is similar to having a human read an article and asking what was it about. A human wouldn't just give a verbose reading of the text. This article shows how blocks of text can be summarized using an abstractive summarization pipeline.
Install dependencies
Install txtai
and all dependencies. Since this article is using optional pipelines, we need to install the pipeline extras package.
pip install txtai[pipeline]
Create a Summary instance
The Summary instance is the main entrypoint for text summarization. This is a light-weight wrapper around the summarization pipeline in Hugging Face Transformers.
In addition to the default model, additional models can be found on the Hugging Face model hub.
from txtai.pipeline import Summary
# Create summary model
summary = Summary()
Summarize text
The example below shows how a large block of text can be distilled down into a smaller summary.
text = ("Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation "
"of the internet and an ever-growing challenge that is never solved or done. The field of Natural Language Processing (NLP) is "
"rapidly evolving with a number of new developments. Large-scale general language models are an exciting new capability "
"allowing us to add amazing functionality quickly with limited compute and people. Innovation continues with new models "
"and advancements coming in at what seems a weekly basis. This article introduces txtai, an AI-powered search engine "
"that enables Natural Language Understanding (NLU) based search in any application."
)
summary(text, maxlength=10)
Search is the foundation of the internet
Notice how the summarizer built a sentence using parts of the document above. It takes a basic understanding of language in order to understand the first two sentences and how to combine them into a single transformative sentence.
Summarize a document
The next section retrieves an article, extracts text from it (more to come on this topic) and summarizes that text.
!wget "https://medium.com/neuml/time-lapse-video-for-the-web-a7d8874ff397"
from txtai.pipeline import Textractor
textractor = Textractor()
text = textractor("time-lapse-video-for-the-web-a7d8874ff397")
summary(text)
Time-lapse video is a popular way to show an area or event over a long period of time. The same concept can be applied to a dynamic real-time website with frequently updated data. webelapse is an open source project developed to provide this functionality. It can be used as is or modified for different use cases.
Click through the link to see the full article. This summary does a pretty good job of covering what the article is about!