Links to Text Summarization with BART Model

Notice

Recent Posts

Recent Comments

Links

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Tags more

Today

In Total

관리 메뉴

A Joyful AI Research Journey🌳😊

Links to Text Summarization with BART Model 본문

🌳AI Projects: NLP🍀✨/NLP Deep Dive

Links to Text Summarization with BART Model

yjyuwisely 2024. 8. 24. 07:00

https://medium.com/@sandyeep70/demystifying-text-summarization-with-deep-learning-ce08d99eda97

Text Summarization with BART Model

Introduction

medium.com

def text_summarizer_from_pdf(pdf_path):
    pdf_text = extract_text_from_pdf(pdf_path)

    model_name = "facebook/bart-large-cnn"
    model = BartForConditionalGeneration.from_pretrained(model_name)
    tokenizer = BartTokenizer.from_pretrained(model_name)

    inputs = tokenizer.encode("summarize: " + pdf_text, return_tensors="pt", max_length=1024, truncation=True)
    summary_ids = model.generate(inputs, max_length=150, min_length=50, length_penalty=2.0, num_beams=4, early_stopping=True)

    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    formatted_summary = "\n".join(textwrap.wrap(summary, width=80))
    return formatted_summary

Let's break down the model.generate() function parameters in more detail:

1. max_length=150:

This sets the maximum number of tokens (words or subwords) that the generated summary can have. The model will stop generating text once it reaches this length. This ensures the summary is concise and doesn't exceed a certain length.

2. min_length=50:

This sets the minimum number of tokens for the generated summary. The model will continue generating text until it reaches this minimum length, ensuring the summary isn't too short.

3. length_penalty=2.0:

The length penalty discourages the model from generating excessively long sequences. A penalty greater than 1 makes the model prefer shorter summaries, while values less than 1 encourage longer summaries. In this case, a penalty of 2.0 means the model will more heavily penalize longer sequences, encouraging more concise output.

4. num_beams=4:

Beam search is a technique to improve the quality of generated text. With num_beams=4, the model explores 4 different potential sequences at each generation step and selects the best one based on the cumulative probability. This approach usually leads to more coherent and higher-quality summaries compared to greedy decoding (where only the most probable word is chosen at each step).

5. early_stopping=True:

This parameter ensures that the beam search process stops as soon as all beams (potential sequences) have generated the end token (like a period or another punctuation mark). This prevents the model from unnecessarily generating more text after the summary is complete.

Summary of Workflow:

Input Preparation: The text is tokenized and encoded into a format the model understands.
Summary Generation: The model generates a summary that is between 50 and 150 tokens long, with a preference for conciseness due to the length penalty.
Beam Search: The model uses 4 beams to explore different possible summaries, ensuring a high-quality result.
Final Summary: The generated text is decoded back into a human-readable format, skipping any special tokens used during the generation process.

This setup ensures that the generated summary is concise, coherent, and of high quality, balancing the need for brevity with the inclusion of essential information.

728x90

저작자표시 비영리 동일조건 (새창열림)

'🌳AI Projects: NLP🍀✨ > NLP Deep Dive' 카테고리의 다른 글

Links to BERT base model (uncased) (0)	2024.08.25
Naive Bayes versus BERT in Sentiment Analysis (0)	2024.08.24
Computing the Posterior Probability Using Bayes' Theorem (0)	2023.09.11
Processing Text Data for Bayesian Inference with Python (0)	2023.09.11

'🌳AI Projects: NLP🍀✨/NLP Deep Dive' Related Articles

Comments

A Joyful AI Research Journey🌳😊

Links to Text Summarization with BART Model 본문

Links to Text Summarization with BART Model

'🌳AI Projects: NLP🍀✨ > NLP Deep Dive' 카테고리의 다른 글

티스토리툴바