tutoteket.no is one of the many independent Mastodon servers you can use to participate in the fediverse.
Tutoteket er ein liten server med liten plass, men vi har lesestoff og god drikke, så vi klarar oss.

Administered by:

Server stats:

7
active users

Børge

Does anyone know of an LLM that I can easily run locally on my Linux PC to do simple text cleanup tasks, like removing unnecessary line breaks and the same word twice in a row, capitalize names, etc?

Well, it doesn't have to be an LLM, of course. Could be any tool. Maybe an extension to ?

Pepito Cleaner helped with unnecessary line breaks, which was a huge part of my problem. Still, removing duplicate words would be a big help, so if you know of any tool, let me know. Thanks! :) pepitoweb.altervista.org/pepit

pepitoweb.altervista.orgHome Page - Pepito Web - Pepito Cleaner

@forteller

I think LLM might be a bit of overkill for this.

What are you cleaning up? sounds like poor OCR results, like you might get in an Internet Archive #ebook.

I use OCRFeeder as a front end for Tesseract, and do my own OCR rather than try and correct something done on autopilot.

Turns out to be quicker and much more accurate. I do it page-by-page and proof as I go.

I found OCRFeeder as a flatpak and it was old and terrible, better to compile the latest version yourself.

@demerara Thanks! But no, not OCR. It's machine transcribed audio from interviews. For some reason Whisper puts in a ton of line breaks in it and doesn't capitalize correctly. Not even "I" and "Google". And since it's interviews, there's a quite a lot of repeating of words like "the" and "and" and other stuff like that.

@forteller
Well the line breaks can be mostly fixed with a Regex search and replace -- look for lower case - \n - lower case and similar. Not so sure how to handle duplicate words or capitalization.

@forteller @simon toots quite a bit about llm and running them on his own machine I think, so his timeline and TIL and blog might provide some clues ☺️