2020. május 29., péntek

PDF fájl átalakítása Wiki formátumra

sudo apt install pandoc
sudo apt-get install poppler-utils

PDF --> HTML
sudo mkdir kimenet
sudo pdftohtml -s -p -fmt png -nodrm "file.pdf" "file/file.html"

You can type pdftohtml -h to gain a better understanding of available parameters.
I've explained the parameters used here for the sake of understanding the command:
  • -s contains all of the output within one HTML document (excluding the outline.
  • -p attempts to replaces pdf internal linking with html links.
  • -fmt controls the output format of images, with png and jpg being valid options.
  • -nodrm igores download rights management restrictions on the PDF.
  • -i ignores images. I didn't use this, but it felt prudent to mention as in some cases it may massively speed your output format.

Alternatív módszer: Poppler pdftotext

pdftotext -htmlmeta "file.pdf" "file.html"

 Replace "file" with the name of the file you want to parse and with the name of the HTML file you want to write your text output to. 
 The `-htmlmeta` option creates an HTML version of the text in your PDF. (This is much less fancy than the previous command and only puts the text in `pre` tags). You should see an HTML file in your directory which you can open to check the results of. Depending on the formatting of your source PDF file, you may find that Poppler is variable in it's effectiveness. You can try running `pdftotext -h` for information on other command options that may improve or worsen your results. 

Pandoc: HTML --> MediaWiki

 pandoc file.html -f html -t mediawiki -s -o file.txt
  • -f bemeneti formátum
  • -t kimeneti formátum
  • -s Standalone adds a header and footer to the document, rather than producing a document fragment.
  • -o The name of the output file.
Pandoc user guide.
It is possible you may run into an error with Pandoc, presumably caused by your file being too large. I ran into this error and some fixes can be found here.

Opció: rossz kódolás kitakarítása

Depending on your PDF encoding, you may find strange Unicode charecters in your HTML output. This step is intended to clean up this output to the best possible degree of accuracy. ftfy, stands for fixes text for you, and it's a Python library with a command-line interface. We'll be using the command line to clean our files. This step is preformed before using Pandoc.

ftfy telepítése:
git clone https://github.com/LuminosoInsight/python-ftfy.git
cd python-ftfy
sudo python setup.py install
Or, if you system has pip, pip install ftfy. Note that if you want to use a version of 5.0 (most recent available at time of writing) or later, you need Python 3. I used Python 2.x with ftfy 4.1.1 for this answer. Using the same directory, type the following command:
 ftfy -o file_clean.html --preserve-entities file.html
Optionally, you may include the --guess option to have ftfy guess your encoding, or --encoding if you know your encoding. This may produce better results.

Nincsenek megjegyzések:

Megjegyzés küldése