Training AI to Paraphrase Text in Python Using Transformers

Training AI to Paraphrase Text in Python Using Transformers | Training Artificial Intelligence to paraphrase a piece of text can be a little complicated. However, by following a series of steps, you can do it easily. It will still require a huge effort because training AI to paraphrase text in Python using transformers requires a large data set of text and a series of codes.

In this article, we’ll explain how AI is trained and how to achieve this task. Here we discuss AI paraphrasing, its working, and how models are trained to provide alternate text versions.  So, let’s delve in.

What is AI Paraphrasing?

The clue is right there in the word. AI paraphrasing means using Artificial Intelligence to alter the text’s wording and structure to bring required changes while keeping its meaning the same. 

Artificial Intelligence has been helping people in almost every area of life. It also finds its application in the area of writing. There have been developed tools that can automatically paraphrase your content. These paraphraser tools use their algorithms to learn certain commands from the user and change the text accordingly.

This is a pretty simple way to paraphrase a text. However, actually training the AI to use these algorithms is much more complicated. Especially for the people who do not know much about algorithms, coding, or AI. 

That’s why we have provided a general overview of how AI is trained to bring such changes in your text so the process can be understood by a general audience. 

Before we begin discussing the process, it is essential that we learn about the terms being used in it.

  • Python: It is a programming language, widely used because of its simplicity, that is used for various purposes e.g., web development, data analysis, developing AI models, etc.
  • Transformers: These are certain programs used to help computers perform different tasks including Natural Language Processing (which is to paraphrase a text.)

Training AI to Paraphrase Text in Python Using Transformers

The following are the steps taken to train AI to paraphrase text in Python.

1. Getting the Basic Programs: Before the process is started, some programs are necessary to be installed on your computer. First of all, you have to install Python (3.6 or higher.) Once you have done it, you will need some libraries which include Transformers (Hugging Face Transformers library), PyTorch or TensorFlow, and some others. Installing these libraries is pretty simple. You can have them by using ‘pip’:-

pip install transformers torch

2. Providing Paraphrased Dataset: A huge amount of data is required to feed the program to make it learn about it. For example, in this case, you have to provide two versions of various sentences: one original, and the other paraphrased.

This helps the machine learn the similarities and differences between these two versions. It learns what type of synonyms are used in the dataset to predict its own paraphrased version when commanded. 

This is not an easy task and you might need someone’s help to perform it. but once you are done, simply move to the next point.

3. Data Processing: Now that you have fed the system with enough data to train an AI model to paraphrase a sentence, you have to tokenize and process it. Transformer libraries can be used to perform this task. The following code will help you do so.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("your_pretrained_model_name")
sentences = ["Original sentence 1", "Paraphrased sentence 1", 
		"Original sentence 2", "Paraphrased sentence 2"]
tokenized_sentences = tokenizer(sentences, padding=True, 
			truncation=True, return_tensors="pt")

You have to replace the “your_pretrained_model_name” with the name of the model you want to fine-tune. Now, move on to the next step.

4. Fine-Tune The Transformer Model: Once you have processed the provided dataset, the next step is choosing a pre-trained Transformer model as your base model. You can use the one that is the most appropriate for you such as GPT-2 or BERT. 

Now, you have to fine-tune this model on your provided paraphrased dataset. You can use sequence-to-sequence or paraphrase task objectives for this purpose. A simple way of doing so is using the Hugging Face Transformers library. You can take an idea from the following code (coded on Google Collaboration):

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer 
from datasets import load_dataset

# Load your dataset using the datasets library.
dataset = load_dataset("your_dataset")

# Load tokenizer and model.
tokenizer = AutoTokenizer.from_pretrained("your_pretrained_model_name") 
model = AutoModelForSeq2SeqLM.from_pretrained("your_pretrained_model_name")

# Define training arguments.
training_args = Seq2SeqTrainingArguments(
	output_dir="./paraphrase_model",
	per_device_train_batch_size=4,
	save_steps=500,
	save_total_limit=2,
	num_train_epochs=5,
	logging_dir="./logs",
	overwrite_output_dir=True,
)

# Create trainer.
trainer = Seq2SeqTrainer(
	model=model,
	data_collator=DataCollatorForSeq2Seq(tokenizer),
	args=training_args,
	train_dataset=dataset["train"],
)

# Start training.
trainer.train()
trainer.save_model()

You have to replace the “your_pretrained_model_name” and “your_dataset” with your own specific model name and dataset to make this code work.

5. Paraphrase the Text:

After you have performed the above-explained steps, it’s time to actually execute the program to get a rephrased text. You can take an idea from the following code to do so:-

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load tokenizer and trained model.
tokenizer = AutoTokenizer.from_pretrained("./paraphrase_model")
model = AutoModelForSeq2SeqLM.from_pretrained("./paraphrase_model")

# Input text to paraphrase.
input_text = "Original sentence to paraphrase."

#Tokenize and generate paraphrase.
input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=512,
							        truncation=True)
paraphrase_ids = model.generate (input_ids, max_length=100, num_return_sequences=5, 
								 num_beams=5, no_repeat_ngram_size=2)

#Decode and print paraphrases.
paraphrases = [tokenizer.decode(ids, skip_special_tokens=True) for ids in paraphrase_ids]
for paraphrase in paraphrases:
	print(paraphrase)

Make sure to replace “./paraphrase_model” with the path to your trained model. 

Performing these tasks will provide you with a paraphrased sentence. Make sure to check the quality of these texts by using programs like ROUGE or METEOR.

Conclusion

AI paraphrasing has become a lot more common these days. There are various purposes for doing it and similarly, various methods to perform it. This includes paraphrasing manually or with the help of an AI tool. 

In the information given above, however, we have actually described how AI is trained to provide you with a paraphrased text in the AI tools. Understanding the steps given above will give you a clear vision of how different programs are used in Python language to train AI to paraphrase a text.

Leave a Comment

Your email address will not be published. Required fields are marked *