Programm PDF -> Excel ?

Youkai

Demon
OP
Member
Joined
Jul 1, 2004
Messages
2,522
Trophies
1
Age
35
Location
Germany , NRW
XP
2,224
Country
Germany
Hi I need some "help".

We get lots of PDF files with invoices and such which we need to change into Excel files ... thats not much of a Problem as you can save it as Excel within Adobe Reader BUT sometimes it kinda rips apart some stuff like we have a table saying

Money Balance
+1.003.412,43€

then sometimes it makse it like
Money
+1.003.412,43€
and puts the Balance into another row which is kind of "bad" for what we need it for.


Right now we are using a Software called Monarch by Altair which seems to do the Job rather well BUT we use a very old version and we can not legaly use it on a Terminal Server anymore which sucks as we are changing our whole environment into Thin Clients with a connection to the Terminal Server.
New version of this Progamm is supposed to cost A LOT ! so we probably won't buy it so I need something else.


We need a Software which is not web based and is possible to use in a TS environment, i tried Nitro but this is actually "to accurate" as there are lots of lines and stuff in the invoices it adds them as well which makes the excel to complicated and to much fo us to use. I also tried PFDtoExcel which destroys the file even more then Adobe XD

Best would be a Programm where we could select which part of the PDF needs to be transfered ... tried it with GTT or however it was called which is supposed to get text from a picture but dunno ... didn't seem to work as I wanted it to do and I could not add PDF files only jpeg and such.


This is pretty much what we get
Unbenannt.JPG
And this is how it looks after we used Monarch
Unbenannt2.JPG
 
Last edited by Youkai,

FAST6191

Techromancer
Editorial Team
Joined
Nov 21, 2005
Messages
35,951
Trophies
3
Website
trastindustries.com
XP
26,540
Country
United Kingdom
If they maintain a common formatting and common export method then there are things that can be done. If they are going to use a non standard system (if your vendors and such are anything like mine or your company is not big enough to compel them to use their particular portal then no chance of that one happening, or at least not one you will be confident to have work and not have to have an intern or something constantly scan over), and maybe vary what PDF export method they use then that can make things tricky (someone changes from a basic PDF "printer" to an internal PDF export and it could change everything despite nominally appearing the same, never mind something more fun like protecting it or making it graphical). PDF is pretty much a one way street as far as anything more than automated filing goes (it is why a lot of people like it) and even when going manual I still expect some manual tweaking. That said most of my work here is for artwork, fishing out text from "print" versions, or recreating document layouts someone does not have templates for. As part of that I have to note that inkscape has saved me on more than one occasion. Similarly most spreadsheets do pretty well with plain old copy and paste, and can set manual column sizes as well as character delimited columns. Depending upon vendor location you might have a bit of fun with the dot and comma swap for fractions and large numbers but that is usually sorted easily enough.

I hate to suggest it as it feels so clunky and crude but some of the OCR (Texterkennung/Optische Zeichenerkennung) stuff might be more in your wheelhouse here. Might have to chain a PDF to PNG program to save hassles with whatever internal or external PDF renderer the OCR program might try to use but that should be nothing major.

Are you sure your program no longer licenses? Or if it is the move to thin clients/remote virtual machines that is causing the fun can you set up a machine to do the task? You might be moving to thin clients for day to day work but if someone can still hotseat (or even some kind of KVM) to a standalone machine you keep imaged (possibly with data on a network share) and thus compliant with existing licenses then that might be a better option if your company is not inclined to chuck several thousand Euros on network/VM compliant Monarch (assuming http://www.qbssoftware.com/products/Altair_Monarch/licensing/_prodmonarch is anything to go by).
 

zxr750j

Well-Known Member
Member
Joined
Sep 29, 2003
Messages
811
Trophies
1
Location
Utrecht
XP
2,379
Country
Netherlands
If you're a bit handy you could use Python for these kind of things.
I think if you use the libs PyPDF2 (to convert the pdf to tables) and then Pandas (formating etc) you can get good results.

Alternatively VBA in excel could also solve a lot of shit.
 

Youkai

Demon
OP
Member
Joined
Jul 1, 2004
Messages
2,522
Trophies
1
Age
35
Location
Germany , NRW
XP
2,224
Country
Germany

Well yeah we do have a portal, actually we have two XD still some customers tend to send us stuff like that and they have different sheets which is rather annoying.

The Programm we use does still exist and they do have a new version is works on Terminal Server (our old licenses won't work on TS anymore) but they want an incredible high ammount of money as the Programm probably can to a lot more then we need it for ... for the money they would want for the Programm we could hire a student to do it manually every day and would probably still save money XD

Tried this GT Text (http://www.softocr.com/) but wasn't really happy with it.

VM might be one possible solution even though not the best and I am not sure if the license would be valid doing this ...

My Boss is currently talking with the people working with Monarch and their Boss to find a solution and maybe just tell the customers they have to send us a proper exel sheet or pay money for our portal as this costs lots of time and money to convert -.-
 

notimp

Well-Known Member
Member
Joined
Sep 18, 2007
Messages
5,782
Trophies
1
XP
4,395
Country
Laos
I have some deeper knowledge of different ways to convert .pdfs into text from the time I screened maybe 20 pdf to text programs for epub creation.

Basically try the following solutions first (both Windows based, but so you know what you are dealing with).

Infix PDF Editor

Abbyy Finereader

Basically the way Adobe converts pdfs to text based formats is by relying on the 'copy/paste' text layer of the pdf. Which is line based. So text gets basically destroyed in terms of context (paragraphs, headings, linebreaks, ...) when put into pdf, and when converted to an editable file format again - the reflow needs to be 'redone'.

The best quality output I've been able to get from this (text layer), is using Infix PDF Editor, which basically does a reflow based on a pixel grid of the original document, interpreting linespacing (what is a paragraph), headings and so on and so forth.

Because its 'pixel perfect' (actually vector positions -) it only works with pdfs that were created using a digital conversion methods (output format to pdf) - it doesnt work with scans.


If you are dealing with scans (think optical imperfections (line warping, ...)), you need OCR, and the best OCR engine is and always was, Abbyys Finereader.
-


So basically - the errors you are describing, should result from the text losing much of its internal formating, and Adobe handling it 'line based'.
To get it back into a reflowable format, it does some heuristics, which arent perfect.

The heuristics in Infix PDF Editor are better by a mile.

But that wont help you if you are once in a while dealing with actual scans (imperfect line alignement, maybe no copy paste text layer, ...) - in that case you need OCR again, and the best one is Finereader.

Any method based on the textlayer alone, even heuristics driven, will lead to worse results converting into a reflowable format. Infix PDF editor, also looks at the pixelgrid (vector positions), the line of text is aligned with in the original Document, and therefore has the best results I've seen converting from that (used for copy/paste) textlayer into a pdf. Which again is line based. (Doesnt know what a paragraph is anymore. Doesnt know what a linebrek is anymore (they are reinterpreted during conversion))

Which means, all methods banking on python or other methods to rearange (reflow) the (copy/paste) text layer - are inherently worse than the two programs I mentioned. (Which also look at positioning in the original document to try to reinterpret what a paragraph was, where multiple linebreaks were - actually based on "visual" information, and not just (regex based) heutistics on the textlayer.)
--

All of this is written with book digitization in mind - so I dont know exactly how it will apply to excel tables, but the concepts should be the same.
 
Last edited by notimp,
  • Like
Reactions: Akorax

You may also like...

General chit-chat
Help Users
  • K3N1 @ K3N1:
    A pill a day doesn't keep the doctor away
  • Psionic Roshambo @ Psionic Roshambo:
    Ken there is no Dr that can fix that
  • K3N1 @ K3N1:
    Dr.Phil?
  • Psionic Roshambo @ Psionic Roshambo:
    Repeated child hood trauma
    +2
  • Psionic Roshambo @ Psionic Roshambo:
    Even Dr Phil says there is no fix for her disorder
  • K3N1 @ K3N1:
    It didn't work for bam margera
  • Psionic Roshambo @ Psionic Roshambo:
    He has some videos on narcessistic personality disorder
  • Psionic Roshambo @ Psionic Roshambo:
    It's heart breaking that a person can go through an entire life and never know true peace or happiness
    +2
  • Psionic Roshambo @ Psionic Roshambo:
    Literally robbed of life
  • AncientBoi @ AncientBoi:
    Oh :shit: gotta get ready for my doctors appmnt. Gastro again. :sad: bye guys
    +1
  • M4x1mumReZ @ M4x1mumReZ:
    See ya later
    +1
  • M4x1mumReZ @ M4x1mumReZ:
    Get better soon
  • K3N1 @ K3N1:
    Remember to tell him to use lube this time
    +1
  • captainbob321 @ captainbob321:
    Hello!
  • captainbob321 @ captainbob321:
    What's so Funny, @M4x1mumReZ
    ?
    +1
  • M4x1mumReZ @ M4x1mumReZ:
    Welcome new user
  • trepp0 @ trepp0:
    Just got the notice that school is being released 2 hours early cause of the winter storm
  • trepp0 @ trepp0:
    lets go
  • Psionic Roshambo @ Psionic Roshambo:
    Better than 3 hours late from a lockdown lol
    +1
  • FAST6191 @ FAST6191:
    While I know you mean in case of someone confusing schooling establishment from shooting establishment (many of the same letters) I am still going to read that as lockdown (also known as lock in) for a pub
  • FAST6191 @ FAST6191:
    where if you were in the club you could be invited to be there as a guest of the landlords after things are supposed to stop being served
  • FAST6191 @ FAST6191:
    Have technically done that in a school but it was the sports centre associated with the school more than the school itself
  • FAST6191 @ FAST6191:
    being drunk in school was either because drunk on playing field or could not be arsed with one particular Thursday afternoon so went and had some beers and played some games instead before returning for an ill advised last lesson/period
    FAST6191 @ FAST6191: being drunk in school was either because drunk on playing field or could not be arsed with one...