Manipulating PDF files with iTextSharp and VB.NET 2012

Introduction
PDF files
iTextSharp
Our Project
Code
LocationTextExtractionStrategy
Conclusion

Introduction

Recently, I had to make a VB.NET program that reads PDF file contents and replace it with customized text. VB.NET unfortunately doesn’t have a built in PDF file reader object, so I had to make use of a third party’s product called iTextSharp. From the moment I started using it, I fell in love with it. With this article I will demonstrate how to use iTextSharp with VB.NET to manipulate PDF files.

PDF files

A detailed explanation of PDF files can be found here.

iTextSharp

A detailed explanation, and download of iTextSharp can be found here. As you can see, iTextSharp is mostly for C# and Java; hence this Visual Basic.NET article.

I would suggest that you go through the documentation properly before proceeding with our project. I cannot do everything for you, you need to have some input as well.

Our Project

Purpose

Our project’s aim is to read from a PDF file, change some of the contents and then add a watermark to the PDF document’s pages. Sound easy enough, yes, with the help of the iTextSharp library you will see how simple it is.

Design

Our project doesn’t have much of a design. All we need is a progress bar and a button. Mine looks like Figure 1 :

Figure 1 – Our Design

Code

Before we can jump in and code, you need to make sure that you have downloaded the iTextSharp libraries. Once that is done, we need to add a reference to it by clicking Project->Add Reference->iTextSharp.dll. Once we have the project reference set up, we need to reference the iTextSharp libraries in our code. Add the following Imports statements:

Imports System.IO 'Working With Files
Imports System.Text 'Working With Text

'iTextSharp Libraries
Imports iTextSharp.text 'Core PDF Text Functionalities
Imports iTextSharp.text.pdf 'PDF Content
Imports iTextSharp.text.pdf.parser 'Content Parser

This imports all the needed capabilities for our little program. Now the fun starts! Add the following Sub Procedure:

    Public Sub ReplacePDFText(ByVal strSearch As String, ByVal scCase As StringComparison, ByVal strSource As String, ByVal strDest As String)

        Dim psStamp As PdfStamper = Nothing 'PDF Stamper Object
        Dim pcbContent As PdfContentByte = Nothing 'Read PDF Content

        If File.Exists(strSource) Then 'Check If File Exists

            Dim pdfFileReader As New PdfReader(strSource) 'Read Our File

            psStamp = New PdfStamper(pdfFileReader, New FileStream(strDest, FileMode.Create)) 'Read Underlying Content of PDF File

            pbProgress.Value = 0 'Set Progressbar Minimum Value
            pbProgress.Maximum = pdfFileReader.NumberOfPages 'Set Progressbar Maximum Value

            For intCurrPage As Integer = 1 To pdfFileReader.NumberOfPages 'Loop Through All Pages

                Dim lteStrategy As LocTextExtractionStrategy = New LocTextExtractionStrategy 'Read PDF File Content Blocks

                pcbContent = psStamp.GetUnderContent(intCurrPage) 'Look At Current Block

                'Determine Spacing of Block To See If It Matches Our Search String
                lteStrategy.UndercontentCharacterSpacing = pcbContent.CharacterSpacing
                lteStrategy.UndercontentHorizontalScaling = pcbContent.HorizontalScaling

                'Trigger The Block Reading Process
                Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfFileReader, intCurrPage, lteStrategy)

                'Determine Match(es)
                Dim lstMatches As List(Of iTextSharp.text.Rectangle) = lteStrategy.GetTextLocations(strSearch, scCase)

                Dim pdLayer As PdfLayer 'Create New Layer
                pdLayer = New PdfLayer("Overrite", psStamp.Writer) 'Enable Overwriting Capabilities

                'Set Fill Colour Of Replacing Layer
                pcbContent.SetColorFill(BaseColor.BLACK)

                For Each rctRect As Rectangle In lstMatches 'Loop Through Each Match

                    pcbContent.Rectangle(rctRect.Left, rctRect.Bottom, rctRect.Width, rctRect.Height) 'Create New Rectangle For Replacing Layer

                    pcbContent.Fill() 'Fill With Colour Specified

                    pcbContent.BeginLayer(pdLayer) 'Create Layer

                    pcbContent.SetColorFill(BaseColor.BLACK) 'Fill aLyer

                    pcbContent.Fill() 'Fill Underlying Content

                    Dim pgState As PdfGState 'Create GState Object
                    pgState = New PdfGState()

                    pcbContent.SetGState(pgState) 'Set Current State

                    pcbContent.SetColorFill(BaseColor.WHITE) 'Fill Letters

                    pcbContent.BeginText() 'Start Text Replace Procedure

                    pcbContent.SetTextMatrix(rctRect.Left, rctRect.Bottom) 'Get Text Location

                    'Set New Font And Size
                    pcbContent.SetFontAndSize(BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED), 9)

                    pcbContent.ShowText("AMAZING!!!!") 'Replacing Text

                    pcbContent.EndText() 'Stop Text Replace Procedure

                    pcbContent.EndLayer() 'Stop Layer replace Procedure

                Next

                pbProgress.Value = pbProgress.Value + 1 'Increase Progressbar Value

                pdfFileReader.Close() 'Close File

            Next

            psStamp.Close() 'Close Stamp Object

        End If

        'Add Watermark
        AddPDFWatermark("C:\test_words_replaced.pdf", "C:\test_Watermarked_and_Replaced.pdf", Application.StartupPath & "\Anuba.jpg")

    End Sub

Oye! What a mouthful!

Before you freak out; this code is actually not so bad. Let’s have a look at it step by step:

We create a Stamper object and a Content object. The Stamper object is to enable us to write our content onto the PDF file. The content object helps us to identify the appropriate content on the file that we need to replace.
We determine if the PDF file exists, and read its underlying content. We also set up our ProgressBar to compensate for the amount of pages in the PDF document.
We commence our For Loop (to loop through each page) and create a LocationTextExtractionStrategy object. This object enables us to extract our desired text. This class also forms part of the iTextSharp download. We need to add this file to our project – but we’ll do that a bit later.
Once we know what text we need, and what diameters the text use, we could continue to loop through all the pages until a match is found. We store each match and create a new layer for each match to be replaced.
We then replace the found text with our new layer that is filled in order to highlight our change. The trick here is to replace the layer’s exact dimensions. A PDF file does not work similar to a Word document where we could just find and replace text. Why? Because each little word or phrase is actually a block, or a layer; so, to replace that particular block, we need the exact dimensions. If we do not have the exact dimensions, the layered text will not appear at the exact same place.
Lastly, we include a call to the AddPDFWatermark sub (which we will create now) to add a watermark on each page. The file that is written will be stored onto the C:\.

Make sense now?

Add the next Sub procedure:

    Public Shared Sub AddPDFWatermark(ByVal strSource As String, ByVal strDest As String, ByVal imgSource As String)

        Dim pdfFileReader As PdfReader = Nothing 'Read File
        Dim psStamp As PdfStamper = Nothing 'PDF Stamper Object
        Dim imgWaterMark As Image = Nothing 'Watermark Image

        Dim pcbContent As PdfContentByte = Nothing 'Read PDF Content
        Dim rctRect As Rectangle = Nothing 'Create New Rectangle To Host Image

        Dim sngX, sngY As Single 'Page Dimensions

        Dim intPageCount As Integer = 0 'Page Count

        Try
            pdfFileReader = New PdfReader(strSource) 'Read File

            rctRect = pdfFileReader.GetPageSizeWithRotation(1) 'Store Page Size

            psStamp = New PdfStamper(pdfFileReader, New System.IO.FileStream(strDest, IO.FileMode.Create)) 'Create new Stamper Object

            imgWaterMark = Image.GetInstance(imgSource) 'Get Image To Be Used For The Watermark

            If imgWaterMark.Width > rctRect.Width OrElse imgWaterMark.Height > rctRect.Height Then 'Make Sure Image Can Fit On Page

                imgWaterMark.ScaleToFit(rctRect.Width, rctRect.Height)
                sngX = (rctRect.Width - imgWaterMark.ScaledWidth) / 2
                sngY = (rctRect.Height - imgWaterMark.ScaledHeight) / 2

            Else 'Put In Center Of Page

                sngX = (rctRect.Width - imgWaterMark.Width) / 2
                sngY = (rctRect.Height - imgWaterMark.Height) / 2

            End If

            imgWaterMark.SetAbsolutePosition(sngX, sngY)

            intPageCount = pdfFileReader.NumberOfPages() 'Apply To All Pages

            For i As Integer = 1 To intPageCount
                pcbContent = psStamp.GetUnderContent(i)
                pcbContent.AddImage(imgWaterMark)
            Next

            psStamp.Close()
            pdfFileReader.Close()

        Catch ex As Exception

            Throw ex 'Something Went Wrong

        End Try

    End Sub

This sub adds a watermark to each PDF page. You will notice that here, we almost do the same as we did in the previous sub. The only difference here is that we added an image to the undercontent of each page, instead of replacing textlayers.

The last piece of code we need to add for this form is the call to the ReplacePDFText sub from our start button:

    Private Sub Start_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Start.Click

        ReplacePDFText("just a simple test", _
                       StringComparison.CurrentCultureIgnoreCase, _
                       Application.StartupPath & "\test.pdf", _
                       "C:\test_words_replaced.pdf") 'Do Everything

    End Sub

This calls the sub to replace PDF content, and writes the new PDF file to a location on C:\. Now, we will have two files. Obviously, this is just and example and it would be easy to combine all of the changes into one file.

LocationTextExtractionStrategy

A full explanation can be found here.

This file forms part of the iTextSharp download I mentioned earlier. We need to add this file as is, to our project. Remember, we didn’t create this file or logic, neither have I. But without this file we will not be able to identify the content strings we are looking for. This demonstrates the real power of iTextSharp, and this is why iTextSharp is my preferred choice when it comes to doing any PDF manipulation.

Add a new class and add the following to it (in case you didn’t download the iTextSharp files at the location I’ve mentioned):

Imports System
Imports System.Collections.Generic
Imports System.Text
Imports iTextSharp.text.pdf
Imports iTextSharp.text.pdf.parser

''
'' * $Id$
'' *
'' * This file is part of the iText project.
'' * Copyright (c) 1998-2009 1T3XT BVBA
'' * Authors: Kevin Day, Bruno Lowagie, Paulo Soares, et al.
'' *
'' * This program is free software; you can redistribute it and/or modify
'' * it under the terms of the GNU Affero General Public License version 3
'' * as published by the Free Software Foundation with the addition of the
'' * following permission added to Section 15 as permitted in Section 7(a):
'' * FOR ANY PART OF THE COVERED WORK IN WHICH THE COPYRIGHT IS OWNED BY 1T3XT,
'' * 1T3XT DISCLAIMS THE WARRANTY OF NON INFRINGEMENT OF THIRD PARTY RIGHTS.
'' *
'' * This program is distributed in the hope that it will be useful, but
'' * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
'' * or FITNESS FOR A PARTICULAR PURPOSE.
'' * See the GNU Affero General Public License for more details.
'' * You should have received a copy of the GNU Affero General Public License
'' * along with this program; if not, see http://www.gnu.org/licenses or write to
'' * the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
'' * Boston, MA, 02110-1301 USA, or download the license from the following URL:
'' * http://itextpdf.com/terms-of-use/
'' *
'' * The interactive user interfaces in modified source and object code versions
'' * of this program must display Appropriate Legal Notices, as required under
'' * Section 5 of the GNU Affero General Public License.
'' *
'' * In accordance with Section 7(b) of the GNU Affero General Public License,
'' * you must retain the producer line in every PDF that is created or manipulated
'' * using iText.
'' *
'' * You can be released from the requirements of the license by purchasing
'' * a commercial license. Buying such a license is mandatory as soon as you
'' * develop commercial activities involving the iText software without
'' * disclosing the source code of your own applications.
'' * These activities include: offering paid services to customers as an ASP,
'' * serving PDFs on the fly in a web application, shipping iText with a closed
'' * source product.
'' *
'' * For more information, please contact iText Software Corp. at this
'' * address: sales@itextpdf.com
''

''*
''     * Development preview - this class (and all of the parser classes) are still experiencing
''     * heavy development, and are subject to change both behavior and interface.
''     * 
''     * A text extraction renderer that keeps track of relative position of text on page
''     * The resultant text will be relatively consistent with the physical layout that most
''     * PDF files have on screen.
''     * 
''     * This renderer keeps track of the orientation and distance (both perpendicular
''     * and parallel) to the unit vector of the orientation.  Text is ordered by
''     * orientation, then perpendicular, then parallel distance.  Text with the same
''     * perpendicular distance, but different parallel distance is treated as being on
''     * the same line.
''     * 
''     * This renderer also uses a simple strategy based on the font metrics to determine if
''     * a blank space should be inserted into the output.
''     *
''     * @since   5.0.2
''
Namespace LocTextExtraction

    Public Class LocTextExtractionStrategy
        Implements ITextExtractionStrategy

        '* set to true for debugging

        Private _UndercontentCharacterSpacing = 0
        Private _UndercontentHorizontalScaling = 0
        Private ThisPdfDocFonts As SortedList(Of String, DocumentFont)

        Public Shared DUMP_STATE As Boolean = False

        '* a summary of all found text

        Private locationalResult As New List(Of TextChunk)()

        '*
        '         * Creates a new text extraction renderer.
        '

        Public Sub New()
            ThisPdfDocFonts = New SortedList(Of String, DocumentFont)
        End Sub

        '*
        '         * @see com.itextpdf.text.pdf.parser.RenderListener#beginTextBlock()
        '

        Public Overridable Sub BeginTextBlock() Implements ITextExtractionStrategy.BeginTextBlock
        End Sub

        '*
        '         * @see com.itextpdf.text.pdf.parser.RenderListener#endTextBlock()
        '

        Public Overridable Sub EndTextBlock() Implements ITextExtractionStrategy.EndTextBlock
        End Sub

        '*
        '         * @param str
        '         * @return true if the string starts with a space character, false if the string is empty or starts with a non-space character
        '

        Private Function StartsWithSpace(ByVal str As [String]) As Boolean
            If str.Length = 0 Then
                Return False
            End If
            Return str(0) = " "c
        End Function

        '*
        '         * @param str
        '         * @return true if the string ends with a space character, false if the string is empty or ends with a non-space character
        '

        Private Function EndsWithSpace(ByVal str As [String]) As Boolean
            If str.Length = 0 Then
                Return False
            End If
            Return str(str.Length - 1) = " "c
        End Function

        Public Property UndercontentCharacterSpacing
            Get
                Return _UndercontentCharacterSpacing
            End Get
            Set(ByVal value)
                _UndercontentCharacterSpacing = value
            End Set
        End Property

        Public Property UndercontentHorizontalScaling
            Get
                Return _UndercontentHorizontalScaling
            End Get
            Set(ByVal value)
                _UndercontentHorizontalScaling = value
            End Set
        End Property

        Public Overridable Function GetResultantText() As [String] Implements ITextExtractionStrategy.GetResultantText

            If DUMP_STATE Then
                DumpState()
            End If

            locationalResult.Sort()

            Dim sb As New StringBuilder()
            Dim lastChunk As TextChunk = Nothing

            For Each chunk As TextChunk In locationalResult

                If lastChunk Is Nothing Then
                    sb.Append(chunk.text)
                Else
                    If chunk.SameLine(lastChunk) Then
                        Dim dist As Single = chunk.DistanceFromEndOf(lastChunk)
                        If dist < -chunk.charSpaceWidth Then
                            sb.Append(" "c)
                            ' we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
                        ElseIf dist > chunk.charSpaceWidth / 2.0F AndAlso Not StartsWithSpace(chunk.text) AndAlso Not EndsWithSpace(lastChunk.text) Then
                            sb.Append(" "c)
                        End If

                        sb.Append(chunk.text)
                    Else
                        sb.Append(ControlChars.Lf)
                        sb.Append(chunk.text)
                    End If
                End If
                lastChunk = chunk
            Next

            Return sb.ToString()

        End Function

        Public Function GetTextLocations(ByVal pSearchString As String, ByVal pStrComp As System.StringComparison) As List(Of iTextSharp.text.Rectangle)
            Dim FoundMatches As New List(Of iTextSharp.text.Rectangle)
            Dim sb As New StringBuilder()
            Dim ThisLineChunks As List(Of TextChunk) = New List(Of TextChunk)
            Dim bStart As Boolean, bEnd As Boolean
            Dim FirstChunk As TextChunk = Nothing, LastChunk As TextChunk = Nothing
            Dim sTextInUsedChunks As String = vbNullString

            For Each chunk As TextChunk In locationalResult

                If ThisLineChunks.Count > 0 AndAlso Not chunk.SameLine(ThisLineChunks.Last) Then
                    If sb.ToString.IndexOf(pSearchString, pStrComp) > -1 Then
                        Dim sLine As String = sb.ToString

                        'Check how many times the Search String is present in this line:
                        Dim iCount As Integer = 0
                        Dim lPos As Integer
                        lPos = sLine.IndexOf(pSearchString, 0, pStrComp)
                        Do While lPos > -1
                            iCount += 1
                            If lPos + pSearchString.Length > sLine.Length Then Exit Do Else lPos = lPos + pSearchString.Length
                            lPos = sLine.IndexOf(pSearchString, lPos, pStrComp)
                        Loop

                        'Process each match found in this Text line:
                        Dim curPos As Integer = 0
                        For i As Integer = 1 To iCount
                            Dim sCurrentText As String, iFromChar As Integer, iToChar As Integer

                            iFromChar = sLine.IndexOf(pSearchString, curPos, pStrComp)
                            curPos = iFromChar
                            iToChar = iFromChar + pSearchString.Length - 1
                            sCurrentText = vbNullString
                            sTextInUsedChunks = vbNullString
                            FirstChunk = Nothing
                            LastChunk = Nothing

                            'Get first and last Chunks corresponding to this match found, from all Chunks in this line
                            For Each chk As TextChunk In ThisLineChunks
                                sCurrentText = sCurrentText & chk.text

                                'Check if we entered the part where we had found a matching String then get this Chunk (First Chunk)
                                If Not bStart AndAlso sCurrentText.Length - 1 >= iFromChar Then
                                    FirstChunk = chk
                                    bStart = True
                                End If

                                'Keep getting Text from Chunks while we are in the part where the matching String had been found
                                If bStart And Not bEnd Then
                                    sTextInUsedChunks = sTextInUsedChunks & chk.text
                                End If

                                'If we get out the matching String part then get this Chunk (last Chunk)
                                If Not bEnd AndAlso sCurrentText.Length - 1 >= iToChar Then
                                    LastChunk = chk
                                    bEnd = True
                                End If

                                'If we already have first and last Chunks enclosing the Text where our String pSearchString has been found
                                'then it's time to get the rectangle, GetRectangleFromText Function below this Function, there we extract the pSearchString locations
                                If bStart And bEnd Then
                                    FoundMatches.Add(GetRectangleFromText(FirstChunk, LastChunk, pSearchString, sTextInUsedChunks, iFromChar, iToChar, pStrComp))
                                    curPos = curPos + pSearchString.Length
                                    bStart = False : bEnd = False
                                    Exit For
                                End If
                            Next
                        Next
                    End If
                    sb.Clear()
                    ThisLineChunks.Clear()
                End If
                ThisLineChunks.Add(chunk)
                sb.Append(chunk.text)
            Next

            Return FoundMatches
        End Function

        Private Function GetRectangleFromText(ByVal FirstChunk As TextChunk, ByVal LastChunk As TextChunk, ByVal pSearchString As String, _
                                   ByVal sTextinChunks As String, ByVal iFromChar As Integer, ByVal iToChar As Integer, ByVal pStrComp As System.StringComparison) As iTextSharp.text.Rectangle

            'There are cases where Chunk contains extra text at begining and end, we don't want this text locations, we need to extract the pSearchString location inside
            'for these cases we need to crop this String (left and Right), and measure this excedent at left and right, at this point we don't have any direct way to make a
            'Transformation from text space points to User Space units, the matrix for making this transformation is not accesible from here, so for these special cases when
            'the String needs to be cropped (Left/Right) We'll interpolate between the width from Text in Chunk (we have this value in User Space units), then i'll measure Text corresponding
            'to the same String but in Text Space units, finally from the relation betweeenthese 2 values I get the TransformationValue I need to use for all cases

            'Text Width in User Space Units
            Dim LineRealWidth As Single = LastChunk.PosRight - FirstChunk.PosLeft

            'Text Width in Text Units
            Dim LineTextWidth As Single = GetStringWidth(sTextinChunks, LastChunk.curFontSize, _
                                                         LastChunk.charSpaceWidth, _
                                                         ThisPdfDocFonts.Values.ElementAt(LastChunk.FontIndex))
            'TransformationValue value for Interpolation
            Dim TransformationValue As Single = LineRealWidth / LineTextWidth

            'In the worst case, we'll need to crop left and right:
            Dim iStart As Integer = sTextinChunks.IndexOf(pSearchString, pStrComp)

            Dim iEnd As Integer = iStart + pSearchString.Length - 1

            Dim sLeft As String
            If iStart = 0 Then sLeft = vbNullString Else sLeft = sTextinChunks.Substring(0, iStart)

            Dim sRight As String
            If iEnd = sTextinChunks.Length - 1 Then sRight = vbNullString Else sRight = sTextinChunks.Substring(iEnd + 1, sTextinChunks.Length - iEnd - 1)

            'Measure cropped Text at left:
            Dim LeftWidth As Single = 0
            If iStart > 0 Then
                LeftWidth = GetStringWidth(sLeft, LastChunk.curFontSize, _
                                                  LastChunk.charSpaceWidth, _
                                                  ThisPdfDocFonts.Values.ElementAt(LastChunk.FontIndex))
                LeftWidth = LeftWidth * TransformationValue
            End If

            'Measure cropped Text at right:
            Dim RightWidth As Single = 0
            If iEnd < sTextinChunks.Length - 1 Then
                RightWidth = GetStringWidth(sRight, LastChunk.curFontSize, _
                                                    LastChunk.charSpaceWidth, _
                                                    ThisPdfDocFonts.Values.ElementAt(LastChunk.FontIndex))
                RightWidth = RightWidth * TransformationValue
            End If

            'LeftWidth is the text width at left we need to exclude, FirstChunk.distParallelStart is the distance to left margin, both together will give us this LeftOffset
            Dim LeftOffset As Single = FirstChunk.distParallelStart + LeftWidth
            'RightWidth is the text width at right we need to exclude, FirstChunk.distParallelEnd is the distance to right margin, we substract RightWidth from distParallelEnd to get RightOffset
            Dim RightOffset As Single = LastChunk.distParallelEnd - RightWidth
            'Return this Rectangle
            Return New iTextSharp.text.Rectangle(LeftOffset, FirstChunk.PosBottom, RightOffset, FirstChunk.PosTop)

        End Function

        Private Function GetStringWidth(ByVal str As String, ByVal curFontSize As Single, ByVal pSingleSpaceWidth As Single, ByVal pFont As DocumentFont) As Single
            Dim chars() As Char = str.ToCharArray()
            Dim totalWidth As Single = 0
            Dim w As Single = 0

            For Each c As Char In chars
                w = pFont.GetWidth(c) / 1000
                totalWidth += (w * curFontSize + Me.UndercontentCharacterSpacing) * Me.UndercontentHorizontalScaling / 100
            Next

            Return totalWidth
        End Function

        Private Sub DumpState()
            For Each location As TextChunk In locationalResult
                location.PrintDiagnostics()
                Console.WriteLine()
            Next
        End Sub

        Public Overridable Sub RenderText(ByVal renderInfo As TextRenderInfo) Implements ITextExtractionStrategy.RenderText
            Dim segment As LineSegment = renderInfo.GetBaseline()
            Dim location As New TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth())

            With location

                'Chunk Location:
                Debug.Print(renderInfo.GetText)
                .PosLeft = renderInfo.GetDescentLine.GetStartPoint(Vector.I1)
                .PosRight = renderInfo.GetAscentLine.GetEndPoint(Vector.I1)
                .PosBottom = renderInfo.GetDescentLine.GetStartPoint(Vector.I2)
                .PosTop = renderInfo.GetAscentLine.GetEndPoint(Vector.I2)
                'Chunk Font Size: (Height)
                .curFontSize = .PosTop - segment.GetStartPoint()(Vector.I2)
                'Use Font name  and Size as Key in the SortedList
                Dim StrKey As String = renderInfo.GetFont.PostscriptFontName & .curFontSize.ToString
                'Add this font to ThisPdfDocFonts SortedList if it's not already present
                If Not ThisPdfDocFonts.ContainsKey(StrKey) Then ThisPdfDocFonts.Add(StrKey, renderInfo.GetFont)
                'Store the SortedList index in this Chunk, so we can get it later
                .FontIndex = ThisPdfDocFonts.IndexOfKey(StrKey)
            End With
            locationalResult.Add(location)
        End Sub

        '*
        '         * Represents a chunk of text, it's orientation, and location relative to the orientation vector
        '

        Public Class TextChunk
            Implements IComparable(Of TextChunk)
            '* the text of the chunk

            Friend text As [String]
            '* the starting location of the chunk

            Friend startLocation As Vector
            '* the ending location of the chunk

            Friend endLocation As Vector
            '* unit vector in the orientation of the chunk

            Friend orientationVector As Vector
            '* the orientation as a scalar for quick sorting

            Friend orientationMagnitude As Integer
            '* perpendicular distance to the orientation unit vector (i.e. the Y position in an unrotated coordinate system)
            '             * we round to the nearest integer to handle the fuzziness of comparing floats

            Friend distPerpendicular As Integer
            '* distance of the start of the chunk parallel to the orientation unit vector (i.e. the X position in an unrotated coordinate system)

            Friend distParallelStart As Single
            '* distance of the end of the chunk parallel to the orientation unit vector (i.e. the X position in an unrotated coordinate system)

            Friend distParallelEnd As Single
            '* the width of a single space character in the font of the chunk

            Friend charSpaceWidth As Single

            Private _PosLeft As Single

            Private _PosRight As Single

            Private _PosTop As Single

            Private _PosBottom As Single

            Private _curFontSize As Single

            Private _FontIndex As Integer

            Public Property FontIndex As Integer
                Get
                    Return _FontIndex
                End Get
                Set(ByVal value As Integer)
                    _FontIndex = value
                End Set
            End Property

            Public Property PosLeft As Single
                Get
                    Return _PosLeft
                End Get
                Set(ByVal value As Single)
                    _PosLeft = value
                End Set
            End Property

            Public Property PosRight As Single
                Get
                    Return _PosRight
                End Get
                Set(ByVal value As Single)
                    _PosRight = value
                End Set
            End Property

            Public Property PosTop As Single
                Get
                    Return _PosTop
                End Get
                Set(ByVal value As Single)
                    _PosTop = value
                End Set
            End Property

            Public Property PosBottom As Single
                Get
                    Return _PosBottom
                End Get
                Set(ByVal value As Single)
                    _PosBottom = value
                End Set
            End Property

            Public Property curFontSize As Single
                Get
                    Return _curFontSize
                End Get
                Set(ByVal value As Single)
                    _curFontSize = value
                End Set
            End Property

            Public Sub New(ByVal str As [String], ByVal startLocation As Vector, ByVal endLocation As Vector, ByVal charSpaceWidth As Single)
                Me.text = str
                Me.startLocation = startLocation
                Me.endLocation = endLocation
                Me.charSpaceWidth = charSpaceWidth

                Dim oVector As Vector = endLocation.Subtract(startLocation)
                If oVector.Length = 0 Then
                    oVector = New Vector(1, 0, 0)
                End If
                orientationVector = oVector.Normalize()
                orientationMagnitude = CInt(Math.Truncate(Math.Atan2(orientationVector(Vector.I2), orientationVector(Vector.I1)) * 1000))

                Dim origin As New Vector(0, 0, 1)
                distPerpendicular = CInt((startLocation.Subtract(origin)).Cross(orientationVector)(Vector.I3))

                distParallelStart = orientationVector.Dot(startLocation)
                distParallelEnd = orientationVector.Dot(endLocation)
            End Sub

            Public Sub PrintDiagnostics()
                Console.WriteLine("Text (@" & Convert.ToString(startLocation) & " -> " & Convert.ToString(endLocation) & "): " & text)
                Console.WriteLine("orientationMagnitude: " & orientationMagnitude)
                Console.WriteLine("distPerpendicular: " & distPerpendicular)
                Console.WriteLine("distParallel: " & distParallelStart)
            End Sub

            '*
            '             * @param as the location to compare to
            '             * @return true is this location is on the the same line as the other
            '

            Public Function SameLine(ByVal a As TextChunk) As Boolean
                If orientationMagnitude <> a.orientationMagnitude Then
                    Return False
                End If
                If distPerpendicular <> a.distPerpendicular Then
                    Return False
                End If
                Return True
            End Function

            '*
            '             * Computes the distance between the end of 'other' and the beginning of this chunk
            '             * in the direction of this chunk's orientation vector.  Note that it's a bad idea
            '             * to call this for chunks that aren't on the same line and orientation, but we don't
            '             * explicitly check for that condition for performance reasons.
            '             * @param other
            '             * @return the number of spaces between the end of 'other' and the beginning of this chunk
            '

            Public Function DistanceFromEndOf(ByVal other As TextChunk) As Single
                Dim distance As Single = distParallelStart - other.distParallelEnd
                Return distance
            End Function

            '*
            '             * Compares based on orientation, perpendicular distance, then parallel distance
            '             * @see java.lang.Comparable#compareTo(java.lang.Object)
            '

            Public Function CompareTo(ByVal rhs As TextChunk) As Integer Implements System.IComparable(Of TextChunk).CompareTo
                If Me Is rhs Then
                    Return 0
                End If
                ' not really needed, but just in case
                Dim rslt As Integer
                rslt = CompareInts(orientationMagnitude, rhs.orientationMagnitude)
                If rslt <> 0 Then
                    Return rslt
                End If

                rslt = CompareInts(distPerpendicular, rhs.distPerpendicular)
                If rslt <> 0 Then
                    Return rslt
                End If

                ' note: it's never safe to check floating point numbers for equality, and if two chunks
                ' are truly right on top of each other, which one comes first or second just doesn't matter
                ' so we arbitrarily choose this way.
                rslt = If(distParallelStart < rhs.distParallelStart, -1, 1)

                Return rslt
            End Function

            '*
            '             *
            '             * @param int1
            '             * @param int2
            '             * @return comparison of the two integers
            '

            Private Shared Function CompareInts(ByVal int1 As Integer, ByVal int2 As Integer) As Integer
                Return If(int1 = int2, 0, If(int1 < int2, -1, 1))
            End Function


        End Class

        '*
        '         * no-op method - this renderer isn't interested in image events
        '         * @see com.itextpdf.text.pdf.parser.RenderListener#renderImage(com.itextpdf.text.pdf.parser.ImageRenderInfo)
        '         * @since 5.0.1
        '

        Public Sub RenderImage(ByVal renderInfo As ImageRenderInfo) Implements IRenderListener.RenderImage
            ' do nothing
        End Sub
    End Class
End Namespace

All we need to do now is to import this namespace into our form. Add the following Imports statement to your form’s code:

Imports PDF_Play.LocTextExtraction 'Import LocationTextExtractionStrategy Capabilities

If we run our project now, it will work as intended.

I am including my project below for you to download. Sadly, the iTextSharp.dll is quite big, and unfortunately too big to include here; so you need to download it through the steps I have outlined for you.

Conclusion

Thank you for reading my article. Obviously, I am only human (don’t be so surprised!), and I can only do so much; but I couldn’t have written this article if it wasn’t for some help I received from a gentleman called jcis. Thank you – sometimes I bite off more than I can chew…

I hope you have enjoyed this article, and actually learned a thing or two from it. Now I’m off to see what new projects I can do and why VB.NET always seem to be second choice and C# first choice for real hardcore complicated projects…