Does anyone know of an LLM that I can easily run locally on my Linux PC to do simple text cleanup tasks, like removing unnecessary line breaks and the same word twice in a row, capitalize names, etc?
Well, it doesn't have to be an LLM, of course. Could be any tool. Maybe an extension to #LibreOffice?
Pepito Cleaner helped with unnecessary line breaks, which was a huge part of my problem. Still, removing duplicate words would be a big help, so if you know of any tool, let me know. Thanks! :) https://pepitoweb.altervista.org/pepito_cleaner/index.php
I think LLM might be a bit of overkill for this.
What are you cleaning up? sounds like poor OCR results, like you might get in an Internet Archive #ebook.
I use OCRFeeder as a front end for Tesseract, and do my own OCR rather than try and correct something done on autopilot.
Turns out to be quicker and much more accurate. I do it page-by-page and proof as I go.
I found OCRFeeder as a flatpak and it was old and terrible, better to compile the latest version yourself.
@demerara Thanks! But no, not OCR. It's machine transcribed audio from interviews. For some reason Whisper puts in a ton of line breaks in it and doesn't capitalize correctly. Not even "I" and "Google". And since it's interviews, there's a quite a lot of repeating of words like "the" and "and" and other stuff like that.
@forteller
Well the line breaks can be mostly fixed with a Regex search and replace -- look for lower case - \n - lower case and similar. Not so sure how to handle duplicate words or capitalization.
@demerara Yeah, I found a tool for that. https://tutoteket.no/@forteller/110837365317284006
@forteller @simon toots quite a bit about llm and running them on his own machine I think, so his timeline and TIL and blog might provide some clues