plom.scan module

Plom tools associated with scanning papers

plom.scan.QRextract(image, try_harder=True)[source]

Decode and return QR codes in an image.

Parameters:
  • image (str/pathlib.Path/PIL.Image) – an image filename, either in the local dir or specified e.g., using pathlib.Path. Can also be an instance of Pillow’s Image.

  • try_harder (bool) – Try to find QRs on a smaller resolution. Defaults to True. Sometimes this seems work around high failure rates in the synthetic images used in CI testing. Details below.

Returns:

Keys “NW”, “NE”, “SW”, “SE”, each with a dict containing

a ‘tpv_signature’, ‘x’, ‘y’ keys that correspond to strings extracted from QR codes (one string per code) and the x-y coordinates of the QR code. The dict is empty if no QR codes found in that corner.

Return type:

dict/None

Without the try_harder flag, we observe high failure rates when the vertical resolution is near 2000 pixels (our current default). This is Issue #967 [1]. It is not prevalent in real-life images, but causes a roughly 5%-10% failure rate in our synthetic CI runs. The workaround (on by default) uses Pillow’s .reduce() to quickly downscale the image. This does increase the run time (have not checked by how much: I assume between 25% and 50%) so if that is more of a concern than error rate, turn off this flag.

[1] https://gitlab.com/plom/plom/-/issues/967

plom.scan.QRextract_legacy(image, write_to_file=True, try_harder=True)[source]

Decode QR codes in an image, return or save them in .qr file.

Parameters:
  • image (str/pathlib.Path/PIL.Image) – an image filename, either in the local dir or specified e.g., using pathlib.Path. Can also be an instance of Pillow’s Image.

  • write_to_file (bool) – by default, the results are written into a file named image.qr (i.e., the same as input name with .qr appended, so something like foo.jpg.qr). If this .qr file already exists and is non-empty, then no action is taken, and None is returned.

  • try_harder (bool) – Try to find QRs on a smaller resolution. Defaults to True. Sometimes this seems work around high failure rates in the synthetic images used in CI testing. Details blow.

Returns:

Keys “NW”, “NE”, “SW”, “SE”, each with a list of the

strings extracted from QR codes, one string per code. The list is empty if no QR codes found in that corner.

Return type:

dict/None

Without the try_harder flag, we observe high failure rates when the vertical resolution is near 2000 pixels (our current default). This is Issue #967 [1]. It is not prevalent in real-life images, but causes a roughly 5%-10% failure rate in our synthetic CI runs. The workaround (on by default) uses Pillow’s .reduce() to quickly downscale the image. This does increase the run time (have not checked by how much: I assume between 25% and 50%) so if that is more of a concern than error rate, turn off this flag.

[1] https://gitlab.com/plom/plom/-/issues/967

plom.scan.processFileToBitmaps(file_name, dest, *, do_not_extract=False, debug_jpeg=False, add_metadata=True)[source]

Extract/convert each page of pdf into bitmap.

We have various ways to do this, in rough order of preference:

  1. Extract a scanned bitmap “as-is”

  2. Render the page with PyMuPDF

  3. Render the page with Ghostscript

The bitmaps will have some metadata written into them to prevent otherwise identical pages from producing images with identical hashes. See Issue #1573.

Parameters:
  • file_name (str, Path) – PDF file from which to extract bitmaps.

  • dest (str, Path) – where to save the resulting bitmap files. Must exist.

Keyword Arguments:
  • do_not_extract (bool) – always render, do no extract even if it seems possible to do so. This is off-by-default until we are confident extracting won’t miss anything. See more detailed description in the user-facing command-line tool plom-scan.

  • debug_jpeg (bool) – make jpegs, randomly rotated of various quality settings, for debugging or demos. Default: False.

  • add_metadata (bool) – add invisible metadata to each image including bundle name and random numbers. Default: True. If you disable this, you can get two identical images (from different pages) giving identical hashes, which in theory is harmless but at least in 2022 was causing database/client issues.

Returns:

an ordered list of the images of each page. Each entry is a pathlib.Path.

Return type:

list

Raises:
  • RuntimeError – not a PDF and not something PyMuPDF can open.

  • TypeError – not a PDF, but it can be opened by PyMuPDF.

  • ValueError – unrealistically tall skinny or very wide pages.

For extracting the scanned data as is, we must be careful not to just grab any image off the page (for example, it must be the only image on the page, and it must not have any annotations on top of it). There are various other conditions; if any of them fail, we fall back on rendering with PyMuPDF.

If the above fail, we fall back on calling Ghostscript as a subprocess (the gs binary). TODO: NOT IMPLEMENTED YET.

plom.scan.processHWScans(pdf_fname, student_id, questions, *, msgr, gamma=False, extractbmp=False, basedir=PosixPath('.'), bundle_name=None)[source]

Process the given PDF bundle into images, upload, then archive the pdf.

Parameters:
  • pdf_fname (pathlib.Path/str) – path to a PDF file. Need not be in the current working directory.

  • student_id (str) –

  • questions (list) –

    to which questions should we upload these pages?

    • a scalar number: all pages map to this question.

    • a list of integers: all pages map to those questions.

    • the string “all” maps each pages to all questions.

    • a list-of-lists specifying which questions each page maps onto, e.g., [[1],[1,2],[2]] maps page 1 onto question 1, page 2 onto questions 1 and 2, and page 3 onto question 2.

    Any string input will parsed to find the above options. Tuples or other iterables should be in place of lists. TODO: Currently dict are not supported, subject to change.

Keyword Arguments:
  • msgr (plom.Messenger/tuple) – either a connected Messenger or a tuple appropriate for credientials.

  • basedir (pathlib.Path) – where on the file system do we perform the work. By default, the current working directory is used. Subdirectories “archivePDFs” and “bundles” will be created.

  • bundle_name (str/None) – Override the bundle name (which is by default is generated from the PDF filename).

  • gamma (bool) –

  • extractbmp (bool) –

Returns:

None

Raises:
  • ValueError – various errors such as cannot find file, no such student id, md5sum collision with existing bundle, etc. Generally things caller could fix. Check message for details.

  • RuntimeError – expected failing conditions.

  • TODO – possibly others, need to drill into process_scans and other methods.

Ask server to map student_id to a test-number; these should have been pre-populated on test-generation so if student_id not known there is an error.

Turn pdf_fname into a bundle name and check with server if that bundle_name / md5sum known.

  • abort if name xor md5sum known,

  • continue otherwise (when both name / md5sum known we assume this is resuming after a crash).

Process PDF into images.

Ask server to create the bundle, which tells us the skip_list which is a list of bundle-orders (i.e., page number within the PDF) that have already been uploaded. In typical use this will be empty.

Then upload pages to the server if not in skip list. Finally archive the bundle.

plom.scan.processMissing(*, msgr, yes_flag)[source]

Replace missing questions with ‘not submitted’ pages

Student may not upload pages for questions they don’t answer. This function asks server for list of all missing hw-questions from all tests that have been used (but are not complete). The server only returns questions that have neither hw-pages nor t-pages - so any partially scanned tests with tpages are avoided.

For each remaining test we replace each missing question with a ‘question not submitted’ page. The user will be prompted in each case unless the ‘yes_flag’ is set.

plom.scan.processScans(pdf_fname, *, msgr, gamma=False, extractbmp=False, demo=False)[source]

Process PDF file into images and read QRcodes

Parameters:

pdf_fname (pathlib.Path/str) – path to a PDF file. Need not be in the current working directory.

Keyword Arguments:
  • msgr (plom.Messenger/tuple) – either a connected Messenger or a tuple appropriate for credientials.

  • bundle_name (str/None) – Override the bundle name (which is by default is generated from the PDF filename).

  • gamma (bool) –

  • extractbmp (bool) –

  • demo (bool) – do things appropriate for a demo such as lower quality or various simulated rotations.

Returns:

None

Convert file into a bundle-name Check with server if bundle/md5 already on server - abort if name xor md5sum known. - continue if neither known or both known Make required directories for processing bundle, convert PDF to images and read QR codes from those.

plom.scan.render_page_to_bitmap(p, dest, basename, bundle_name, debug_jpeg=False, add_metadata=True)[source]

Use PyMuPDF to render a PDF page to an image.

Parameters:
  • p (fitz.Page) –

  • dest (pathlib.Path) – where to save the resulting bitmap file.

  • basename (str) –

  • bundle_name (str/pathlib.Path) – only used for metadata hackery uniqifying pages, you can pass whatever you want.

Keyword Arguments:

add_metadata (bool) – add invisible metadata to each image including bundle name and random numbers. Default: True. If you disable this, you can get two identical images (from different pages) giving identical hashes, which in theory is harmless but at least in 2022 was causing database/client issues.

Returns:

the rendered image on disc.

Return type:

pathlib.Path

Raises:

ValueError – overly weird shapes such as too tall (“Safeway receipt”) or two wide (“fortune cookie”).

plom.scan.rotate_bitmap(fname, angle, *, clockwise=False)[source]

Rotate bitmap counterclockwise, possibly in metadata.

Parameters:
  • filename (pathlib.Path/str) – name of a file

  • angle (int) – CCW angle of rotation: 0, 90, 180, 270, or -90.

Keyword Arguments:

clockwise (bool) – By default this is False and we do anti-clockwise (“counter-clockwise”) rotations. Pass True if you want +90 to be a clockwise rotation instead.

If its a jpeg, we have special handling, otherwise, we use the Python library PIL to open, rotate and then resave the image, replacing the original.

plom.scan.try_to_extract_image(p, doc, dest, basename, bundle_name, *, do_not_extract=False, add_metadata=True)[source]

If possible/desirable, extract an image from a PDF page and save to disc.

“Desirable” means there are no additional markings on the page; no information will be lost by looking only at the extracted image instead of the original page.

Parameters:
  • p (fitz.Page) –

  • doc (fitz.Document) –

  • dest (pathlib.Path) – where to save the resulting bitmap file.

  • basename (str) –

  • bundle_name (str/pathlib.Path) – only used for metadata hackery uniqifying pages, you can pass whatever you want.

Keyword Arguments:
  • do_not_extract (bool) – always render, do no extract even if it seems possible to do so. This is off-by-default until we are confident extracting won’t miss anything. See more detailed description in the user-facing command-line tool plom-scan.

  • add_metadata (bool) – add invisible metadata to each image including bundle name and random numbers. Default: True. If you disable this, you can get two identical images (from different pages) giving identical hashes, which in theory is harmless but at least in 2022 was causing database/client issues.

Returns:

first entry is pathlib.Path or None, where None means we could not (or chose not) to extract. Whereas a Path means we have extracted the image. The second return value is msgs a list of strings, which give semi-user-readable info about why we cannot/choose not to extract.

Return type:

2-tuple

plom.scan.uploadImages(bundle_name, *, msgr, do_unknowns=False, do_collisions=False, prompt=True)[source]

Upload processed images from bundle.

Parameters:

bundle_name (str) – usually the PDF filename but in general whatever string was used to define a bundle.

Keyword Arguments:
  • msgr (plom.Messenger/tuple) – either a connected Messenger or a tuple appropriate for credientials.

  • do_unknowns (bool) –

  • do_collisions (bool) –

  • prompt (bool) – ok to interactively prompt (default: True).

Returns:

None

Try to create a bundle on server. - abort if name xor md5sum of bundle known. - continue otherwise (server will give skip-list) Skip images whose page within the bundle is in the skip-list since those are already uploaded. Once uploaded archive the bundle pdf.

As part of the upload ‘unknown’ pages and ‘collisions’ may be detected. These will not be uploaded unless the appropriate flags are set.