Programm PDF -> Excel ?

Youkai

Demon
OP
Member
Joined
Jul 1, 2004
Messages
2,552
Trophies
1
Age
36
Location
Germany , NRW
XP
2,445
Country
Germany
Hi I need some "help".

We get lots of PDF files with invoices and such which we need to change into Excel files ... thats not much of a Problem as you can save it as Excel within Adobe Reader BUT sometimes it kinda rips apart some stuff like we have a table saying

Money Balance
+1.003.412,43€

then sometimes it makse it like
Money
+1.003.412,43€
and puts the Balance into another row which is kind of "bad" for what we need it for.


Right now we are using a Software called Monarch by Altair which seems to do the Job rather well BUT we use a very old version and we can not legaly use it on a Terminal Server anymore which sucks as we are changing our whole environment into Thin Clients with a connection to the Terminal Server.
New version of this Progamm is supposed to cost A LOT ! so we probably won't buy it so I need something else.


We need a Software which is not web based and is possible to use in a TS environment, i tried Nitro but this is actually "to accurate" as there are lots of lines and stuff in the invoices it adds them as well which makes the excel to complicated and to much fo us to use. I also tried PFDtoExcel which destroys the file even more then Adobe XD

Best would be a Programm where we could select which part of the PDF needs to be transfered ... tried it with GTT or however it was called which is supposed to get text from a picture but dunno ... didn't seem to work as I wanted it to do and I could not add PDF files only jpeg and such.


This is pretty much what we get
Unbenannt.JPG
And this is how it looks after we used Monarch
Unbenannt2.JPG
 
Last edited by Youkai,

FAST6191

Techromancer
Editorial Team
Joined
Nov 21, 2005
Messages
36,798
Trophies
3
XP
28,321
Country
United Kingdom
If they maintain a common formatting and common export method then there are things that can be done. If they are going to use a non standard system (if your vendors and such are anything like mine or your company is not big enough to compel them to use their particular portal then no chance of that one happening, or at least not one you will be confident to have work and not have to have an intern or something constantly scan over), and maybe vary what PDF export method they use then that can make things tricky (someone changes from a basic PDF "printer" to an internal PDF export and it could change everything despite nominally appearing the same, never mind something more fun like protecting it or making it graphical). PDF is pretty much a one way street as far as anything more than automated filing goes (it is why a lot of people like it) and even when going manual I still expect some manual tweaking. That said most of my work here is for artwork, fishing out text from "print" versions, or recreating document layouts someone does not have templates for. As part of that I have to note that inkscape has saved me on more than one occasion. Similarly most spreadsheets do pretty well with plain old copy and paste, and can set manual column sizes as well as character delimited columns. Depending upon vendor location you might have a bit of fun with the dot and comma swap for fractions and large numbers but that is usually sorted easily enough.

I hate to suggest it as it feels so clunky and crude but some of the OCR (Texterkennung/Optische Zeichenerkennung) stuff might be more in your wheelhouse here. Might have to chain a PDF to PNG program to save hassles with whatever internal or external PDF renderer the OCR program might try to use but that should be nothing major.

Are you sure your program no longer licenses? Or if it is the move to thin clients/remote virtual machines that is causing the fun can you set up a machine to do the task? You might be moving to thin clients for day to day work but if someone can still hotseat (or even some kind of KVM) to a standalone machine you keep imaged (possibly with data on a network share) and thus compliant with existing licenses then that might be a better option if your company is not inclined to chuck several thousand Euros on network/VM compliant Monarch (assuming http://www.qbssoftware.com/products/Altair_Monarch/licensing/_prodmonarch is anything to go by).
 

zxr750j

Well-Known Member
Member
Joined
Sep 29, 2003
Messages
935
Trophies
2
Location
Utrecht
XP
2,930
Country
Netherlands
If you're a bit handy you could use Python for these kind of things.
I think if you use the libs PyPDF2 (to convert the pdf to tables) and then Pandas (formating etc) you can get good results.

Alternatively VBA in excel could also solve a lot of shit.
 

Youkai

Demon
OP
Member
Joined
Jul 1, 2004
Messages
2,552
Trophies
1
Age
36
Location
Germany , NRW
XP
2,445
Country
Germany

Well yeah we do have a portal, actually we have two XD still some customers tend to send us stuff like that and they have different sheets which is rather annoying.

The Programm we use does still exist and they do have a new version is works on Terminal Server (our old licenses won't work on TS anymore) but they want an incredible high ammount of money as the Programm probably can to a lot more then we need it for ... for the money they would want for the Programm we could hire a student to do it manually every day and would probably still save money XD

Tried this GT Text (http://www.softocr.com/) but wasn't really happy with it.

VM might be one possible solution even though not the best and I am not sure if the license would be valid doing this ...

My Boss is currently talking with the people working with Monarch and their Boss to find a solution and maybe just tell the customers they have to send us a proper exel sheet or pay money for our portal as this costs lots of time and money to convert -.-
 

notimp

Well-Known Member
Member
Joined
Sep 18, 2007
Messages
5,779
Trophies
1
XP
4,420
Country
Laos
I have some deeper knowledge of different ways to convert .pdfs into text from the time I screened maybe 20 pdf to text programs for epub creation.

Basically try the following solutions first (both Windows based, but so you know what you are dealing with).

Infix PDF Editor

Abbyy Finereader

Basically the way Adobe converts pdfs to text based formats is by relying on the 'copy/paste' text layer of the pdf. Which is line based. So text gets basically destroyed in terms of context (paragraphs, headings, linebreaks, ...) when put into pdf, and when converted to an editable file format again - the reflow needs to be 'redone'.

The best quality output I've been able to get from this (text layer), is using Infix PDF Editor, which basically does a reflow based on a pixel grid of the original document, interpreting linespacing (what is a paragraph), headings and so on and so forth.

Because its 'pixel perfect' (actually vector positions -) it only works with pdfs that were created using a digital conversion methods (output format to pdf) - it doesnt work with scans.


If you are dealing with scans (think optical imperfections (line warping, ...)), you need OCR, and the best OCR engine is and always was, Abbyys Finereader.
-


So basically - the errors you are describing, should result from the text losing much of its internal formating, and Adobe handling it 'line based'.
To get it back into a reflowable format, it does some heuristics, which arent perfect.

The heuristics in Infix PDF Editor are better by a mile.

But that wont help you if you are once in a while dealing with actual scans (imperfect line alignement, maybe no copy paste text layer, ...) - in that case you need OCR again, and the best one is Finereader.

Any method based on the textlayer alone, even heuristics driven, will lead to worse results converting into a reflowable format. Infix PDF editor, also looks at the pixelgrid (vector positions), the line of text is aligned with in the original Document, and therefore has the best results I've seen converting from that (used for copy/paste) textlayer into a pdf. Which again is line based. (Doesnt know what a paragraph is anymore. Doesnt know what a linebrek is anymore (they are reinterpreted during conversion))

Which means, all methods banking on python or other methods to rearange (reflow) the (copy/paste) text layer - are inherently worse than the two programs I mentioned. (Which also look at positioning in the original document to try to reinterpret what a paragraph was, where multiple linebreaks were - actually based on "visual" information, and not just (regex based) heutistics on the textlayer.)
--

All of this is written with book digitization in mind - so I dont know exactly how it will apply to excel tables, but the concepts should be the same.
 
Last edited by notimp,
  • Like
Reactions: Akorax

Site & Scene News

Popular threads in this forum

General chit-chat
Help Users
    I @ idonthave: :)