PDF to GPT 3.5 Fine Tuning in Python

PDF to GPT 3.5 Fine Tuning in Python

An example Python script that takes a folder of PDF files, extracts the text from each PDF, and fine-tunes a GPT-3.5 model from OpenAI while considering the 4096 tokens limit:

import os
import openai
import PyPDF2

# Set OpenAI API key
openai.api_key = 'YOUR_API_KEY'

# Function to extract text from a PDF file
def extract_text_from_pdf(file_path):
    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
    return text.strip()

# Function to split text into chunks respecting the token limit
def split_text_into_chunks(text, token_limit):
    chunks = []
    while len(text) > token_limit:
        chunk = text[:token_limit]
        last_period_index = chunk.rfind('.')
        if last_period_index != -1:
            chunk = chunk[:last_period_index + 1]
        chunks.append(chunk)
        text = text[len(chunk):]
    chunks.append(text)
    return chunks

# Function to fine-tune the GPT-3.5 model
def fine_tune_model(training_data):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": training_data}
        ],
        training_configuration={
            'language': 'en',
            'use_case': 'fine_tuning',
            'model_name': 'gpt-3.5-turbo'
        }
    )
    model_id = response['id']
    return model_id

# Folder path containing the PDF files
folder_path = 'path/to/folder'

# List to store the extracted text from PDF files
pdf_texts = []

# Iterate through the PDF files in the folder
for filename in os.listdir(folder_path):
    if filename.endswith('.pdf'):
        file_path = os.path.join(folder_path, filename)
        pdf_text = extract_text_from_pdf(file_path)
        pdf_texts.append(pdf_text)

# Concatenate the extracted text from all PDF files
all_text = '\n'.join(pdf_texts)

# Split the text into chunks respecting the token limit
token_limit = 4096
text_chunks = split_text_into_chunks(all_text, token_limit)

# Fine-tune the model with the text chunks
training_data = '\n'.join(text_chunks)
model_id = fine_tune_model(training_data)

# Save the fine-tuned model ID to a file
with open('fine_tuned_model.txt', 'w') as file:
    file.write(model_id)

In this script, you need to replace 'YOUR_API_KEY' with your actual OpenAI API key. The script uses the PyPDF2 library to extract text from PDF files.

The extract_text_from_pdf function takes a file path and returns the extracted text from the PDF file.

The split_text_into_chunks function splits the text into chunks of a specified token limit while ensuring that each chunk ends with a complete sentence.

The fine_tune_model function performs the fine-tuning process using the training data and returns the model ID.

The script iterates through the PDF files in the specified folder, extracts the text from each PDF, concatenates all the text, and then splits it into chunks respecting the 4096 token limit. It then fine-tunes the GPT-3.5 model using the text chunks and saves the model ID to a file named 'fine_tuned_model.txt'.

Remember to install the required libraries (openai and PyPDF2) before running the script.