Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js
Date: 1/31/2018 @ 2 p.m. ET
During last years, Portable Document Format (PDF) gained wide popularity as a standard for distribution of printable documents and presentations. Manuals for computer hardware and cars are distributed in PDF format, Internet standards and standard proposals are now offered as portable documents. Even search engines start to process PDF files and index them. Among other uses, PDF becomes a standard for reporting and form filling activities.
Of course, not all documents, published in PDF format, are released for public use. Privately distributed documents may contain commercial secrets, personal information or other data, that should not be disclosed. And even public documents may be intended to be of any use. Some publisher may decide to let the user read the document, but not print it. Other limitations of use are also possible. This problem is called Digital Rights Management. PDF format defines certain techniques that must ensure such limitations. Unfortunately, most of them can be circumvented by using third-party software, and no suitable technical solution has been found so far.
Another common task is authenticity proof. The publisher often needs to confirm, that document was really published by him. This is done by electronically signing the document. The user of the document can later check the signature and ensure that the document was published by certain publisher and that the document has not been tampered since it has been published. PDF specification defines a certificate-based signing scheme and lets third parties extend the specification with more schemes. When reporting and form filling activities are taken, the electronic signature becomes an integral part of the document being created too.
One great benefit (and, at the same time, certain disadvantage) is that PDF security is built into the specification. In other words, the encrypted or signed document is still a PDF document and can be opened with PDF management software without taking extra actions. Of course, this software will need to decrypt the document, if it's encrypted, but in all cases it can reach metadata (information about the document) and document structure. For signed documents, the person can fill document forms at the same time not breaking integrity of the document. This benefit is not available with other document formats so far.
Standard encryption measures (symmetric encryption)
Currently, most PDF publishers use built-in symmetric encryption of the document. This encryption is supported by Acrobat software and by many other PDF creating and viewing solutions.
Symmetric encryption is often called password-based. With symmetric encryption, the publisher and intended users of the documents must know some secret key (password, passphrase). This key is used to both protect the document and to access the protected data. Once the document has been protected (encrypted), the secret key is required to get access to the document contents. The scheme is easy to use (since the secret key is usually a text), however the disadvantage is also significant - the secret key (password) must be passed from the publisher to intended user(s). Also, the password is the same for each user, so when documents are published on a regular basis and some user must be excluded from the list, then the password must be changed and the rest of users must be notified about this approach. Management of passwords in this case can become a nightmare.
Another disadvantage of the password-based encryption is that passwords, chosen by people, are often weak and can be guessed or discovered using dictionary attack (where the dictionary of words is used) or brute-force attack (type of attack, when the passwords are constructed one by one and tried on the document being attacked). Detailed analysis of attacks on password-based encryption is beyond the scope of this article.
PDF specification defines use of RC4 and AES symmetric algorithms to encrypt the data. For RC4 the length of the encryption key (the encryption key is usually derived from the password) is 40 or 128 bit. For AES the length of the encryption key is 128 bit. 40-bit keys are very weak and don't provide desired level of protection at all. However, 128-bit symmetric encryption is subject to software export regulations in many countries and this may cause certain problems if you develop software, which creates or manages encrypted PDF documents.
As said, encryption key is derived from the "secret key" or password in some way. This means that no matter how long your password is security will not be stronger than defined by the length of the encryption key. 128-bit security, on the other hand, provides necessary level of confidence (if you take measures to counteract to the above mentioned attacks).
When encryption is used, the data chunks are encrypted. Document structure is generally available, since it's not encrypted. Also, by default the same encryption method is used for the whole document.
Asymmetric encryption is often called public-key encryption. In fact, a pair of keys is used -- a public key and a secret (private) key. Each intended user of the PDF document must have a pair of keys. He gives its public key to the publisher and keeps the private key in a safe place. The publisher uses public keys of the users to encrypt the document. Each user applies own private key to decrypt the document and to read it.
Let's consider the example, when the client opens the bank account. He should fill some application form and send it to the bank. In case of symmetric encryption, the user would have to encrypt the application form with the password, send a form and a password to the bank. Both of these information blocks can be intercepted, analyzed and used against the bank or the client. Also, the bank must manage a myriad of passwords from all of its clients. With asymmetric encryption the bank maintains just a pair of keys. And each client uses bank's public key to encrypt the application.
From technical point of view, the keys are used to encrypt or decrypt a symmetric encryption key, which in turn is used with above mentioned encryption algorithms to encrypt the actual data. But since the encryption key is created randomly for each data encryption operation, it is not vulnerable to guessing (dictionary attacks) or most brute-force attacks.
The scheme with asymmetric encryption is more easy to use when the document series are distributed. If the user is to be excluded from the distribution, the document is just not encrypted using this user's key. There's no need to distribute new passwords, as in case of symmetric encryption.
Unlike passwords, public and secret keys in asymmetric encryption are quite long (each key is 128 bytes or longer). This makes it harder to manage those keys. To solve the problem of management of so long keys, X.509 certificates from PKI (public key infrastructure, set of standards that define creation, use and management of key pairs in asymmetric encryption) were employed. With X.509 certificate, public key is contained in the certificate structure, which, besides the key itself, also contains information about who created and who is authorized to use the keys, validity period for these keys, intended use of the certificate and more. The private key is linked to the X.509 certificate.
Certificate management is a large topic, which is in brief discussed in other articles written by EldoS. Talking about PDF security, we must know that certificates are a handy way to solve certain security problems. On the other hand, certificates involve use of certificate authorities, which prevent certificates from being compromised. Certificate validation requires checking each certificate with the authority to ensure, that the certificate has not been revoked (cancelled) and that it has not been tampered or forged. This is a purely technical step that must be taken to properly implement PKI management, but this step does exist.
As with symmetric encryption, public-key based encryption encrypts the data chunks in the whole document.
Signing the documents
As we already said before, signing the document is necessary to prove, that the document was really created by the person who pretends to be its author, and also to ensure that the document was not altered in any way. For example, when sending a tax report, the reporter must ensure that he is responsible for the report data, and also there must be some counteraction to attacks from third-parties (which could alter the report and make harm to the reporter).
Due to its nature, signing of the data involves PKI and key pairs. The signing process works as follows: the publisher calculates the hash (special number, derived from the document data) and uses its private key to encrypt the calculated hash. The encrypted hash and the public key are enclosed with the document. The user of the document calculates the hash too and compares it with the hash, included in the document (the enclosed hash is first decrypted using the enclosed public key). Also, validity of the enclosed public key is checked.
As with asymmetric encryption, management of key pairs becomes much easier, if X.509 certificates are employed. So PDF specification defines two signing schemes and both of them use X.509 certificates.
Document signature is applied to the document in whole. It is possible to exclude PDF forms (each form can be excluded separately, and it can be excluded completely or just certain fields) from the signing process. This lets the user fill the already signed document without breaking the signature.
Sometimes it is important to know, when exactly the document was issued. For example, if tax report is sent, it is important to ensure that it was sent within the period, defined by law. If the publisher (reporter in our example) just puts the time to the document, this will not be a proof, since the value can be incorrect. To ensure the correct value, third-party time-stamping service must be used. What does this service do? It takes the document (or the hash of the document data) and issues some other data block (a time-stamp). This block can be verified later to ensure correctness of the time-stamp. In this scenario the time mark is actually put by third-party trusted service.
One important thing in PDF specification is that it doesn't limit the publishers and software developers to predefined encryption, signing and compression schemes, but lets them extend the functionality by introducing other schemes using so-called document handlers. For example, one may introduce biometric signing or PGP-based encryption by creating an add-in module for Acrobat and defining its own scheme as a specification (for developers of third-party software). So far, add-ins (handlers) were created for PKI-based and biometric signing of documents.
If some specification, which performs document processing, is introduced, it is not absolutely necessary to create Acrobat add-in. The security handler, which is implemented in your software, might be enough for your needs.
If you decide to create Acrobat security handler, you need to know, that Acrobat supports signed and unsigned add-ins. Add-ins are usually signed by Adobe and they become trusted add-ins. If the add-in is trusted, it can be blocked by Acrobat. So it makes sense to sign the add-in with Adobe.
PDF security in your applications
When you create a software product, which displays, creates or in other way manages PDF documents, you will have to deal with PDF security. Most software components and libraries used for PDF creation and management support symmetric encryption of the documents. However, with PKI-based encryption and signing things are much worse -- PKI security is not supported by vast majority of PDF management solutions.
SecureBlackbox (with its PDFBlackbox package) provides support for both symmetric and certificate-based encryption and for certificate-based signing. SecureBlackbox is a collection of components for Windows and .NET development. You will find more about PDFBlackbox at http://www.eldos.com/sbb/desc-pdf.php