Notice
Recent Posts
Recent Comments
«   2024/11   »
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
Archives
Today
In Total
관리 메뉴

A Joyful AI Research Journey🌳😊

Links to Text Summarization with BART Model 본문

🌳AI Projects: NLP🍀✨/NLP Deep Dive

Links to Text Summarization with BART Model

yjyuwisely 2024. 8. 24. 07:00

https://medium.com/@sandyeep70/demystifying-text-summarization-with-deep-learning-ce08d99eda97

 

Text Summarization with BART Model

Introduction

medium.com

def text_summarizer_from_pdf(pdf_path):
    pdf_text = extract_text_from_pdf(pdf_path)

    model_name = "facebook/bart-large-cnn"
    model = BartForConditionalGeneration.from_pretrained(model_name)
    tokenizer = BartTokenizer.from_pretrained(model_name)

    inputs = tokenizer.encode("summarize: " + pdf_text, return_tensors="pt", max_length=1024, truncation=True)
    summary_ids = model.generate(inputs, max_length=150, min_length=50, length_penalty=2.0, num_beams=4, early_stopping=True)

    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    formatted_summary = "\n".join(textwrap.wrap(summary, width=80))
    return formatted_summary

Let's break down the model.generate() function parameters in more detail:

1. max_length=150:

  • This sets the maximum number of tokens (words or subwords) that the generated summary can have. The model will stop generating text once it reaches this length. This ensures the summary is concise and doesn't exceed a certain length.

2. min_length=50:

  • This sets the minimum number of tokens for the generated summary. The model will continue generating text until it reaches this minimum length, ensuring the summary isn't too short.

3. length_penalty=2.0:

  • The length penalty discourages the model from generating excessively long sequences. A penalty greater than 1 makes the model prefer shorter summaries, while values less than 1 encourage longer summaries. In this case, a penalty of 2.0 means the model will more heavily penalize longer sequences, encouraging more concise output.

4. num_beams=4:

  • Beam search is a technique to improve the quality of generated text. With num_beams=4, the model explores 4 different potential sequences at each generation step and selects the best one based on the cumulative probability. This approach usually leads to more coherent and higher-quality summaries compared to greedy decoding (where only the most probable word is chosen at each step).

5. early_stopping=True:

  • This parameter ensures that the beam search process stops as soon as all beams (potential sequences) have generated the end token (like a period or another punctuation mark). This prevents the model from unnecessarily generating more text after the summary is complete.

Summary of Workflow:

  • Input Preparation: The text is tokenized and encoded into a format the model understands.
  • Summary Generation: The model generates a summary that is between 50 and 150 tokens long, with a preference for conciseness due to the length penalty.
  • Beam Search: The model uses 4 beams to explore different possible summaries, ensuring a high-quality result.
  • Final Summary: The generated text is decoded back into a human-readable format, skipping any special tokens used during the generation process.

This setup ensures that the generated summary is concise, coherent, and of high quality, balancing the need for brevity with the inclusion of essential information.

 

728x90
반응형
Comments