Introduction
Recently, I had to make a VB.NET program that reads PDF file contents and replace it with customized text. VB.NET unfortunately doesn’t have a built in PDF file reader object, so I had to make use of a third party’s product called iTextSharp. From the moment I started using it, I fell in love with it. With this article I will demonstrate how to use iTextSharp with VB.NET to manipulate PDF files.
PDF files
A detailed explanation of PDF files can be found here.
iTextSharp
A detailed explanation, and download of iTextSharp can be found here. As you can see, iTextSharp is mostly for C# and Java; hence this Visual Basic.NET article.
I would suggest that you go through the documentation properly before proceeding with our project. I cannot do everything for you, you need to have some input as well.
Our Project
Purpose
Our project’s aim is to read from a PDF file, change some of the contents and then add a watermark to the PDF document’s pages. Sound easy enough, yes, with the help of the iTextSharp library you will see how simple it is.
Design
Our project doesn’t have much of a design. All we need is a progress bar and a button. Mine looks like Figure 1 :

Figure 1 – Our Design
Code
Before we can jump in and code, you need to make sure that you have downloaded the iTextSharp libraries. Once that is done, we need to add a reference to it by clicking Project->Add Reference->iTextSharp.dll. Once we have the project reference set up, we need to reference the iTextSharp libraries in our code. Add the following Imports statements:
Imports System.IO
Imports System.Text
Imports iTextSharp.text
Imports iTextSharp.text.pdf
Imports iTextSharp.text.pdf.parser
This imports all the needed capabilities for our little program. Now the fun starts! Add the following Sub Procedure:
Public Sub ReplacePDFText(ByVal strSearch As String, ByVal scCase As StringComparison, ByVal strSource As String, ByVal strDest As String)
Dim psStamp As PdfStamper = Nothing
Dim pcbContent As PdfContentByte = Nothing
If File.Exists(strSource) Then
Dim pdfFileReader As New PdfReader(strSource)
psStamp = New PdfStamper(pdfFileReader, New FileStream(strDest, FileMode.Create))
pbProgress.Value = 0
pbProgress.Maximum = pdfFileReader.NumberOfPages
For intCurrPage As Integer = 1 To pdfFileReader.NumberOfPages
Dim lteStrategy As LocTextExtractionStrategy = New LocTextExtractionStrategy
pcbContent = psStamp.GetUnderContent(intCurrPage)
lteStrategy.UndercontentCharacterSpacing = pcbContent.CharacterSpacing
lteStrategy.UndercontentHorizontalScaling = pcbContent.HorizontalScaling
Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfFileReader, intCurrPage, lteStrategy)
Dim lstMatches As List(Of iTextSharp.text.Rectangle) = lteStrategy.GetTextLocations(strSearch, scCase)
Dim pdLayer As PdfLayer
pdLayer = New PdfLayer("Overrite", psStamp.Writer)
pcbContent.SetColorFill(BaseColor.BLACK)
For Each rctRect As Rectangle In lstMatches
pcbContent.Rectangle(rctRect.Left, rctRect.Bottom, rctRect.Width, rctRect.Height)
pcbContent.Fill()
pcbContent.BeginLayer(pdLayer)
pcbContent.SetColorFill(BaseColor.BLACK)
pcbContent.Fill()
Dim pgState As PdfGState
pgState = New PdfGState()
pcbContent.SetGState(pgState)
pcbContent.SetColorFill(BaseColor.WHITE)
pcbContent.BeginText()
pcbContent.SetTextMatrix(rctRect.Left, rctRect.Bottom)
pcbContent.SetFontAndSize(BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED), 9)
pcbContent.ShowText("AMAZING!!!!")
pcbContent.EndText()
pcbContent.EndLayer()
Next
pbProgress.Value = pbProgress.Value + 1
pdfFileReader.Close()
Next
psStamp.Close()
End If
AddPDFWatermark("C:\test_words_replaced.pdf", "C:\test_Watermarked_and_Replaced.pdf", Application.StartupPath & "\Anuba.jpg")
End Sub
Oye! What a mouthful!
Before you freak out; this code is actually not so bad. Let’s have a look at it step by step:
- We create a Stamper object and a Content object. The Stamper object is to enable us to write our content onto the PDF file. The content object helps us to identify the appropriate content on the file that we need to replace.
- We determine if the PDF file exists, and read its underlying content. We also set up our ProgressBar to compensate for the amount of pages in the PDF document.
- We commence our For Loop (to loop through each page) and create a LocationTextExtractionStrategy object. This object enables us to extract our desired text. This class also forms part of the iTextSharp download. We need to add this file to our project – but we’ll do that a bit later.
- Once we know what text we need, and what diameters the text use, we could continue to loop through all the pages until a match is found. We store each match and create a new layer for each match to be replaced.
- We then replace the found text with our new layer that is filled in order to highlight our change. The trick here is to replace the layer’s exact dimensions. A PDF file does not work similar to a Word document where we could just find and replace text. Why? Because each little word or phrase is actually a block, or a layer; so, to replace that particular block, we need the exact dimensions. If we do not have the exact dimensions, the layered text will not appear at the exact same place.
- Lastly, we include a call to the AddPDFWatermark sub (which we will create now) to add a watermark on each page. The file that is written will be stored onto the C:\.
Make sense now?
Add the next Sub procedure:
Public Shared Sub AddPDFWatermark(ByVal strSource As String, ByVal strDest As String, ByVal imgSource As String)
Dim pdfFileReader As PdfReader = Nothing
Dim psStamp As PdfStamper = Nothing
Dim imgWaterMark As Image = Nothing
Dim pcbContent As PdfContentByte = Nothing
Dim rctRect As Rectangle = Nothing
Dim sngX, sngY As Single
Dim intPageCount As Integer = 0
Try
pdfFileReader = New PdfReader(strSource)
rctRect = pdfFileReader.GetPageSizeWithRotation(1)
psStamp = New PdfStamper(pdfFileReader, New System.IO.FileStream(strDest, IO.FileMode.Create))
imgWaterMark = Image.GetInstance(imgSource)
If imgWaterMark.Width > rctRect.Width OrElse imgWaterMark.Height > rctRect.Height Then
imgWaterMark.ScaleToFit(rctRect.Width, rctRect.Height)
sngX = (rctRect.Width - imgWaterMark.ScaledWidth) / 2
sngY = (rctRect.Height - imgWaterMark.ScaledHeight) / 2
Else
sngX = (rctRect.Width - imgWaterMark.Width) / 2
sngY = (rctRect.Height - imgWaterMark.Height) / 2
End If
imgWaterMark.SetAbsolutePosition(sngX, sngY)
intPageCount = pdfFileReader.NumberOfPages()
For i As Integer = 1 To intPageCount
pcbContent = psStamp.GetUnderContent(i)
pcbContent.AddImage(imgWaterMark)
Next
psStamp.Close()
pdfFileReader.Close()
Catch ex As Exception
Throw ex
End Try
End Sub
This sub adds a watermark to each PDF page. You will notice that here, we almost do the same as we did in the previous sub. The only difference here is that we added an image to the undercontent of each page, instead of replacing textlayers.
The last piece of code we need to add for this form is the call to the ReplacePDFText sub from our start button:
Private Sub Start_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Start.Click
ReplacePDFText("just a simple test", _
StringComparison.CurrentCultureIgnoreCase, _
Application.StartupPath & "\test.pdf", _
"C:\test_words_replaced.pdf")
End Sub
This calls the sub to replace PDF content, and writes the new PDF file to a location on C:\. Now, we will have two files. Obviously, this is just and example and it would be easy to combine all of the changes into one file.
A full explanation can be found here.
This file forms part of the iTextSharp download I mentioned earlier. We need to add this file as is, to our project. Remember, we didn’t create this file or logic, neither have I. But without this file we will not be able to identify the content strings we are looking for. This demonstrates the real power of iTextSharp, and this is why iTextSharp is my preferred choice when it comes to doing any PDF manipulation.
Add a new class and add the following to it (in case you didn’t download the iTextSharp files at the location I’ve mentioned):
Imports System
Imports System.Collections.Generic
Imports System.Text
Imports iTextSharp.text.pdf
Imports iTextSharp.text.pdf.parser
Namespace LocTextExtraction
Public Class LocTextExtractionStrategy
Implements ITextExtractionStrategy
Private _UndercontentCharacterSpacing = 0
Private _UndercontentHorizontalScaling = 0
Private ThisPdfDocFonts As SortedList(Of String, DocumentFont)
Public Shared DUMP_STATE As Boolean = False
Private locationalResult As New List(Of TextChunk)()
Public Sub New()
ThisPdfDocFonts = New SortedList(Of String, DocumentFont)
End Sub
Public Overridable Sub BeginTextBlock() Implements ITextExtractionStrategy.BeginTextBlock
End Sub
Public Overridable Sub EndTextBlock() Implements ITextExtractionStrategy.EndTextBlock
End Sub
Private Function StartsWithSpace(ByVal str As [String]) As Boolean
If str.Length = 0 Then
Return False
End If
Return str(0) = " "c
End Function
Private Function EndsWithSpace(ByVal str As [String]) As Boolean
If str.Length = 0 Then
Return False
End If
Return str(str.Length - 1) = " "c
End Function
Public Property UndercontentCharacterSpacing
Get
Return _UndercontentCharacterSpacing
End Get
Set(ByVal value)
_UndercontentCharacterSpacing = value
End Set
End Property
Public Property UndercontentHorizontalScaling
Get
Return _UndercontentHorizontalScaling
End Get
Set(ByVal value)
_UndercontentHorizontalScaling = value
End Set
End Property
Public Overridable Function GetResultantText() As [String] Implements ITextExtractionStrategy.GetResultantText
If DUMP_STATE Then
DumpState()
End If
locationalResult.Sort()
Dim sb As New StringBuilder()
Dim lastChunk As TextChunk = Nothing
For Each chunk As TextChunk In locationalResult
If lastChunk Is Nothing Then
sb.Append(chunk.text)
Else
If chunk.SameLine(lastChunk) Then
Dim dist As Single = chunk.DistanceFromEndOf(lastChunk)
If dist < -chunk.charSpaceWidth Then
sb.Append(" "c)
ElseIf dist > chunk.charSpaceWidth / 2.0F AndAlso Not StartsWithSpace(chunk.text) AndAlso Not EndsWithSpace(lastChunk.text) Then
sb.Append(" "c)
End If
sb.Append(chunk.text)
Else
sb.Append(ControlChars.Lf)
sb.Append(chunk.text)
End If
End If
lastChunk = chunk
Next
Return sb.ToString()
End Function
Public Function GetTextLocations(ByVal pSearchString As String, ByVal pStrComp As System.StringComparison) As List(Of iTextSharp.text.Rectangle)
Dim FoundMatches As New List(Of iTextSharp.text.Rectangle)
Dim sb As New StringBuilder()
Dim ThisLineChunks As List(Of TextChunk) = New List(Of TextChunk)
Dim bStart As Boolean, bEnd As Boolean
Dim FirstChunk As TextChunk = Nothing, LastChunk As TextChunk = Nothing
Dim sTextInUsedChunks As String = vbNullString
For Each chunk As TextChunk In locationalResult
If ThisLineChunks.Count > 0 AndAlso Not chunk.SameLine(ThisLineChunks.Last) Then
If sb.ToString.IndexOf(pSearchString, pStrComp) > -1 Then
Dim sLine As String = sb.ToString
Dim iCount As Integer = 0
Dim lPos As Integer
lPos = sLine.IndexOf(pSearchString, 0, pStrComp)
Do While lPos > -1
iCount += 1
If lPos + pSearchString.Length > sLine.Length Then Exit Do Else lPos = lPos + pSearchString.Length
lPos = sLine.IndexOf(pSearchString, lPos, pStrComp)
Loop
Dim curPos As Integer = 0
For i As Integer = 1 To iCount
Dim sCurrentText As String, iFromChar As Integer, iToChar As Integer
iFromChar = sLine.IndexOf(pSearchString, curPos, pStrComp)
curPos = iFromChar
iToChar = iFromChar + pSearchString.Length - 1
sCurrentText = vbNullString
sTextInUsedChunks = vbNullString
FirstChunk = Nothing
LastChunk = Nothing
For Each chk As TextChunk In ThisLineChunks
sCurrentText = sCurrentText & chk.text
If Not bStart AndAlso sCurrentText.Length - 1 >= iFromChar Then
FirstChunk = chk
bStart = True
End If
If bStart And Not bEnd Then
sTextInUsedChunks = sTextInUsedChunks & chk.text
End If
If Not bEnd AndAlso sCurrentText.Length - 1 >= iToChar Then
LastChunk = chk
bEnd = True
End If
If bStart And bEnd Then
FoundMatches.Add(GetRectangleFromText(FirstChunk, LastChunk, pSearchString, sTextInUsedChunks, iFromChar, iToChar, pStrComp))
curPos = curPos + pSearchString.Length
bStart = False : bEnd = False
Exit For
End If
Next
Next
End If
sb.Clear()
ThisLineChunks.Clear()
End If
ThisLineChunks.Add(chunk)
sb.Append(chunk.text)
Next
Return FoundMatches
End Function
Private Function GetRectangleFromText(ByVal FirstChunk As TextChunk, ByVal LastChunk As TextChunk, ByVal pSearchString As String, _
ByVal sTextinChunks As String, ByVal iFromChar As Integer, ByVal iToChar As Integer, ByVal pStrComp As System.StringComparison) As iTextSharp.text.Rectangle
Dim LineRealWidth As Single = LastChunk.PosRight - FirstChunk.PosLeft
Dim LineTextWidth As Single = GetStringWidth(sTextinChunks, LastChunk.curFontSize, _
LastChunk.charSpaceWidth, _
ThisPdfDocFonts.Values.ElementAt(LastChunk.FontIndex))
Dim TransformationValue As Single = LineRealWidth / LineTextWidth
Dim iStart As Integer = sTextinChunks.IndexOf(pSearchString, pStrComp)
Dim iEnd As Integer = iStart + pSearchString.Length - 1
Dim sLeft As String
If iStart = 0 Then sLeft = vbNullString Else sLeft = sTextinChunks.Substring(0, iStart)
Dim sRight As String
If iEnd = sTextinChunks.Length - 1 Then sRight = vbNullString Else sRight = sTextinChunks.Substring(iEnd + 1, sTextinChunks.Length - iEnd - 1)
Dim LeftWidth As Single = 0
If iStart > 0 Then
LeftWidth = GetStringWidth(sLeft, LastChunk.curFontSize, _
LastChunk.charSpaceWidth, _
ThisPdfDocFonts.Values.ElementAt(LastChunk.FontIndex))
LeftWidth = LeftWidth * TransformationValue
End If
Dim RightWidth As Single = 0
If iEnd < sTextinChunks.Length - 1 Then
RightWidth = GetStringWidth(sRight, LastChunk.curFontSize, _
LastChunk.charSpaceWidth, _
ThisPdfDocFonts.Values.ElementAt(LastChunk.FontIndex))
RightWidth = RightWidth * TransformationValue
End If
Dim LeftOffset As Single = FirstChunk.distParallelStart + LeftWidth
Dim RightOffset As Single = LastChunk.distParallelEnd - RightWidth
Return New iTextSharp.text.Rectangle(LeftOffset, FirstChunk.PosBottom, RightOffset, FirstChunk.PosTop)
End Function
Private Function GetStringWidth(ByVal str As String, ByVal curFontSize As Single, ByVal pSingleSpaceWidth As Single, ByVal pFont As DocumentFont) As Single
Dim chars() As Char = str.ToCharArray()
Dim totalWidth As Single = 0
Dim w As Single = 0
For Each c As Char In chars
w = pFont.GetWidth(c) / 1000
totalWidth += (w * curFontSize + Me.UndercontentCharacterSpacing) * Me.UndercontentHorizontalScaling / 100
Next
Return totalWidth
End Function
Private Sub DumpState()
For Each location As TextChunk In locationalResult
location.PrintDiagnostics()
Console.WriteLine()
Next
End Sub
Public Overridable Sub RenderText(ByVal renderInfo As TextRenderInfo) Implements ITextExtractionStrategy.RenderText
Dim segment As LineSegment = renderInfo.GetBaseline()
Dim location As New TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth())
With location
Debug.Print(renderInfo.GetText)
.PosLeft = renderInfo.GetDescentLine.GetStartPoint(Vector.I1)
.PosRight = renderInfo.GetAscentLine.GetEndPoint(Vector.I1)
.PosBottom = renderInfo.GetDescentLine.GetStartPoint(Vector.I2)
.PosTop = renderInfo.GetAscentLine.GetEndPoint(Vector.I2)
.curFontSize = .PosTop - segment.GetStartPoint()(Vector.I2)
Dim StrKey As String = renderInfo.GetFont.PostscriptFontName & .curFontSize.ToString
If Not ThisPdfDocFonts.ContainsKey(StrKey) Then ThisPdfDocFonts.Add(StrKey, renderInfo.GetFont)
.FontIndex = ThisPdfDocFonts.IndexOfKey(StrKey)
End With
locationalResult.Add(location)
End Sub
Public Class TextChunk
Implements IComparable(Of TextChunk)
Friend text As [String]
Friend startLocation As Vector
Friend endLocation As Vector
Friend orientationVector As Vector
Friend orientationMagnitude As Integer
Friend distPerpendicular As Integer
Friend distParallelStart As Single
Friend distParallelEnd As Single
Friend charSpaceWidth As Single
Private _PosLeft As Single
Private _PosRight As Single
Private _PosTop As Single
Private _PosBottom As Single
Private _curFontSize As Single
Private _FontIndex As Integer
Public Property FontIndex As Integer
Get
Return _FontIndex
End Get
Set(ByVal value As Integer)
_FontIndex = value
End Set
End Property
Public Property PosLeft As Single
Get
Return _PosLeft
End Get
Set(ByVal value As Single)
_PosLeft = value
End Set
End Property
Public Property PosRight As Single
Get
Return _PosRight
End Get
Set(ByVal value As Single)
_PosRight = value
End Set
End Property
Public Property PosTop As Single
Get
Return _PosTop
End Get
Set(ByVal value As Single)
_PosTop = value
End Set
End Property
Public Property PosBottom As Single
Get
Return _PosBottom
End Get
Set(ByVal value As Single)
_PosBottom = value
End Set
End Property
Public Property curFontSize As Single
Get
Return _curFontSize
End Get
Set(ByVal value As Single)
_curFontSize = value
End Set
End Property
Public Sub New(ByVal str As [String], ByVal startLocation As Vector, ByVal endLocation As Vector, ByVal charSpaceWidth As Single)
Me.text = str
Me.startLocation = startLocation
Me.endLocation = endLocation
Me.charSpaceWidth = charSpaceWidth
Dim oVector As Vector = endLocation.Subtract(startLocation)
If oVector.Length = 0 Then
oVector = New Vector(1, 0, 0)
End If
orientationVector = oVector.Normalize()
orientationMagnitude = CInt(Math.Truncate(Math.Atan2(orientationVector(Vector.I2), orientationVector(Vector.I1)) * 1000))
Dim origin As New Vector(0, 0, 1)
distPerpendicular = CInt((startLocation.Subtract(origin)).Cross(orientationVector)(Vector.I3))
distParallelStart = orientationVector.Dot(startLocation)
distParallelEnd = orientationVector.Dot(endLocation)
End Sub
Public Sub PrintDiagnostics()
Console.WriteLine("Text (@" & Convert.ToString(startLocation) & " -> " & Convert.ToString(endLocation) & "): " & text)
Console.WriteLine("orientationMagnitude: " & orientationMagnitude)
Console.WriteLine("distPerpendicular: " & distPerpendicular)
Console.WriteLine("distParallel: " & distParallelStart)
End Sub
Public Function SameLine(ByVal a As TextChunk) As Boolean
If orientationMagnitude <> a.orientationMagnitude Then
Return False
End If
If distPerpendicular <> a.distPerpendicular Then
Return False
End If
Return True
End Function
Public Function DistanceFromEndOf(ByVal other As TextChunk) As Single
Dim distance As Single = distParallelStart - other.distParallelEnd
Return distance
End Function
Public Function CompareTo(ByVal rhs As TextChunk) As Integer Implements System.IComparable(Of TextChunk).CompareTo
If Me Is rhs Then
Return 0
End If
Dim rslt As Integer
rslt = CompareInts(orientationMagnitude, rhs.orientationMagnitude)
If rslt <> 0 Then
Return rslt
End If
rslt = CompareInts(distPerpendicular, rhs.distPerpendicular)
If rslt <> 0 Then
Return rslt
End If
rslt = If(distParallelStart < rhs.distParallelStart, -1, 1)
Return rslt
End Function
Private Shared Function CompareInts(ByVal int1 As Integer, ByVal int2 As Integer) As Integer
Return If(int1 = int2, 0, If(int1 < int2, -1, 1))
End Function
End Class
Public Sub RenderImage(ByVal renderInfo As ImageRenderInfo) Implements IRenderListener.RenderImage
End Sub
End Class
End Namespace
All we need to do now is to import this namespace into our form. Add the following Imports statement to your form’s code:
Imports PDF_Play.LocTextExtraction
If we run our project now, it will work as intended.
I am including my project below for you to download. Sadly, the iTextSharp.dll is quite big, and unfortunately too big to include here; so you need to download it through the steps I have outlined for you.
Conclusion
Thank you for reading my article. Obviously, I am only human (don’t be so surprised!), and I can only do so much; but I couldn’t have written this article if it wasn’t for some help I received from a gentleman called jcis. Thank you – sometimes I bite off more than I can chew…
I hope you have enjoyed this article, and actually learned a thing or two from it. Now I’m off to see what new projects I can do and why VB.NET always seem to be second choice and C# first choice for real hardcore complicated projects…