Text processing | CodeGuru

Text processing

Bruce Eckel’s Thinking in Java Contents | Prev | Next If you come from a C or C++ background, you might be skeptical at first of Java’s power when it comes to handling text. Indeed, one drawback is that execution speed is slower and that could hinder some of your efforts. However, the tools (in […]

Written By
CodeGuru Staff
CodeGuru Staff
Mar 1, 2001
18 minute read
CodeGuru content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

If


you come from a C or C++ background, you might be skeptical at first of


Java’s power when it comes to handling text. Indeed, one drawback is that


execution speed is slower and that could hinder some of your efforts. However,


the tools (in particular the

String
class) are quite powerful, as the examples in this section show (and
performance improvements have been promised for Java).

As


you’ll see, these examples were created to solve problems that arose in


the creation of this book. However, they are not restricted to that and the


solutions they offer can easily be adapted to other situations. In addition,


they show the power of Java in an area that has not previously been emphasized


in this book.


Extracting
code listings

You’ve


no doubt noticed that each complete code listing (not code fragment) in this


book begins and ends with special comment tag marks ‘


//:


and ‘


///:~

’.


This meta-information is included so that the code can be automatically


extracted from the book into compilable source-code files. In my previous book,


I had a system that allowed me to automatically incorporate tested code files


into the book. In this book, however, I discovered that it was often easier to


paste the code into the book once it was initially tested and, since it’s


hard to get right the first time, to perform edits to the code within the book.


But how to extract it and test the code? This program is the answer, and it


could come in handy when you set out to solve a text processing problem. It


also demonstrates many of the


String

class features.

I


first save the entire book in ASCII text format into a separate file. The


CodePackager

program has two modes (which you can see described in


usageString

):


if you use the


-p

flag, it expects to see an input file containing the ASCII text from the book.


It will go through this file and use the comment tag marks to extract the code,


and it uses the file name on the first line to determine the name of the file.


In addition, it looks for the


package

statement in case it needs to put the file into a special directory (chosen via


the path indicated by the


package

statement).

But


that’s not all. It also watches for the change in chapters by keeping


track of the package names. Since all packages for each chapter begin with


c02

,


c03

,


c04

,


etc. to indicate the chapter where they belong



(except


for those beginning with


com

,


which are ignored for the purpose of keeping track of chapters), as long as the


first listing in each chapter contains a


package

statement with the chapter number, the


CodePackager

program can keep track of when the chapter changed and put all the subsequent


files in the new chapter subdirectory.

As


each file is extracted, it is placed into a


SourceCodeFile

object that is then placed into a collection. (This process will be more


thoroughly described later.) These


SourceCodeFile

objects could simply be stored in files, but that brings us to the second use


for this project. If you invoke


CodePackager
without

the


-p

flag it expects a “packed” file as input, which it will then


extract into separate files. So the


-p

flag means that the extracted files will be found “packed” into


this single file.

Why


bother with the packed file? Because different computer platforms have


different ways of storing text information in files. A big issue is the


end-of-line character or characters, but other issues can also exist. However,


Java has a special type of IO stream – the

DataOutputStream

which promises that, regardless of what machine the data is coming from, the
storage of that data will be in a form that can be correctly retrieved by any
other machine by using a
DataInputStream.
That is, Java handles all of the
platform-specific
details, which is a large part of the promise of Java. So the
-p
flag stores everything into a single file in a universal format. You download
this file and the Java program from the Web, and when you run
CodePackager
on this file
without
the
-p
flag the files will all be extracted to appropriate places on your system. (You
can specify an alternate subdirectory; otherwise the subdirectories will just
be created in the current directory.) To ensure that no system-specific formats
remain,
File
objects are used everywhere a path or a file is described. In addition,
there’s a sanity check: an empty file is placed in each subdirectory; the
name of that file indicates how many files you should find in that subdirectory.

Here


is the code, which will be described in detail at the end of the listing:

//: CodePackager.java
// "Packs" and "unpacks" the code in "Thinking 
// in Java" for cross-platform distribution.
/* Commented so CodePackager sees it and starts
   a new chapter directory, but so you don't
   have to worry about the directory where this
   program lives:
package c17;
*/
import java.util.*;
import java.io.*;
 
class Pr {
  static void error(String e) {
    System.err.println("ERROR: " + e);
    System.exit(1);
  }
}
 
class IO {
  static BufferedReader disOpen(File f) {
    BufferedReader in = null;
    try {
      in = new BufferedReader(
        new FileReader(f));
    } catch(IOException e) {
      Pr.error("could not open " + f);
    }
    return in;
  }
  static BufferedReader disOpen(String fname) {
    return disOpen(new File(fname));
  }
  static DataOutputStream dosOpen(File f) {
    DataOutputStream in = null;
    try {
      in = new DataOutputStream(
        new BufferedOutputStream(
          new FileOutputStream(f)));
    } catch(IOException e) {
      Pr.error("could not open " + f);
    }
    return in;
  }
  static DataOutputStream dosOpen(String fname) {
    return dosOpen(new File(fname));
  }
  static PrintWriter psOpen(File f) {
    PrintWriter in = null;
    try {
      in = new PrintWriter(
        new BufferedWriter(
          new FileWriter(f)));
    } catch(IOException e) {
      Pr.error("could not open " + f);
    }
    return in;
  }
  static PrintWriter psOpen(String fname) {
    return psOpen(new File(fname));
  }
  static void close(Writer os) {
    try {
      os.close();
    } catch(IOException e) {
      Pr.error("closing " + os);
    }
  }
  static void close(DataOutputStream os) {
    try {
      os.close();
    } catch(IOException e) {
      Pr.error("closing " + os);
    }
  }
  static void close(Reader os) {
    try {
      os.close();
    } catch(IOException e) {
      Pr.error("closing " + os);
    }
  }
}
 
class SourceCodeFile {
  public static final String
    startMarker = "//:", // Start of source file
    endMarker = "} ///:~", // End of source
    endMarker2 = "}; ///:~", // C++ file end
    beginContinue = "} ///:Continued",
    endContinue = "///:Continuing",
    packMarker = "###", // Packed file header tag
    eol = // Line separator on current system
      System.getProperty("line.separator"),
    filesep = // System's file path separator
      System.getProperty("file.separator");
  public static String copyright = "";
  static {
    try {
      BufferedReader cr =
        new BufferedReader(
          new FileReader("Copyright.txt"));
      String crin;
      while((crin = cr.readLine()) != null)
        copyright += crin + "n";
      cr.close();
    } catch(Exception e) {
      copyright = "";
    }
  }
  private String filename, dirname,
    contents = new String();
  private static String chapter = "c02";
  // The file name separator from the old system:
  public static String oldsep;
  public String toString() {
    return dirname + filesep + filename;
  }
  // Constructor for parsing from document file:
  public SourceCodeFile(String firstLine,
      BufferedReader in) {
    dirname = chapter;
    // Skip past marker:
    filename = firstLine.substring(
        startMarker.length()).trim();
    // Find space that terminates file name:
    if(filename.indexOf(' ') != -1)
      filename = filename.substring(
          0, filename.indexOf(' '));
    System.out.println("found: " + filename);
    contents = firstLine + eol;
    if(copyright.length() != 0)
      contents += copyright + eol;
    String s;
    boolean foundEndMarker = false;
    try {
      while((s = in.readLine()) != null) {
        if(s.startsWith(startMarker))
          Pr.error("No end of file marker for " +
            filename);
        // For this program, no spaces before 
        // the "package" keyword are allowed
        // in the input source code:
        else if(s.startsWith("package")) {
          // Extract package name:
          String pdir = s.substring(
            s.indexOf(' ')).trim();
          pdir = pdir.substring(
            0, pdir.indexOf(';')).trim();
          // Capture the chapter from the package
          // ignoring the 'com' subdirectories:
          if(!pdir.startsWith("com")) {
            int firstDot = pdir.indexOf('.');
            if(firstDot != -1)
              chapter =
                pdir.substring(0,firstDot);
            else
              chapter = pdir;
          }
          // Convert package name to path name:
          pdir = pdir.replace(
            '.', filesep.charAt(0));
          System.out.println("package " + pdir);
          dirname = pdir;
        }
        contents += s + eol;
        // Move past continuations:
        if(s.startsWith(beginContinue))
          while((s = in.readLine()) != null)
            if(s.startsWith(endContinue)) {
              contents += s + eol;
              break;
            }
        // Watch for end of code listing:
        if(s.startsWith(endMarker) ||
           s.startsWith(endMarker2)) {
          foundEndMarker = true;
          break;
        }
      }
      if(!foundEndMarker)
        Pr.error(
          "End marker not found before EOF");
      System.out.println("Chapter: " + chapter);
    } catch(IOException e) {
      Pr.error("Error reading line");
    }
  }
  // For recovering from a packed file:
  public SourceCodeFile(BufferedReader pFile) {
    try {
      String s = pFile.readLine();
      if(s == null) return;
      if(!s.startsWith(packMarker))
        Pr.error("Can't find " + packMarker
          + " in " + s);
      s = s.substring(
        packMarker.length()).trim();
      dirname = s.substring(0, s.indexOf('#'));
      filename = s.substring(s.indexOf('#') + 1);
      dirname = dirname.replace(
        oldsep.charAt(0), filesep.charAt(0));
      filename = filename.replace(
        oldsep.charAt(0), filesep.charAt(0));
      System.out.println("listing: " + dirname
        + filesep + filename);
      while((s = pFile.readLine()) != null) {
        // Watch for end of code listing:
        if(s.startsWith(endMarker) ||
           s.startsWith(endMarker2)) {
          contents += s;
          break;
        }
        contents += s + eol;
      }
    } catch(IOException e) {
      System.err.println("Error reading line");
    }
  }
  public boolean hasFile() {
    return filename != null;
  }
  public String directory() { return dirname; }
  public String filename() { return filename; }
  public String contents() { return contents; }
  // To write to a packed file:
  public void writePacked(DataOutputStream out) {
    try {
      out.writeBytes(
        packMarker + dirname + "#"
        + filename + eol);
      out.writeBytes(contents);
    } catch(IOException e) {
      Pr.error("writing " + dirname +
        filesep + filename);
    }
  }
  // To generate the actual file:
  public void writeFile(String rootpath) {
    File path = new File(rootpath, dirname);
    path.mkdirs();
    PrintWriter p =
      IO.psOpen(new File(path, filename));
    p.print(contents);
    IO.close(p);
  }
}
 
class DirMap {
  private Hashtable t = new Hashtable();
  private String rootpath;
  DirMap() {
    rootpath = System.getProperty("user.dir");
  }
  DirMap(String alternateDir) {
    rootpath = alternateDir;
  }
  public void add(SourceCodeFile f){
    String path = f.directory();
    if(!t.containsKey(path))
      t.put(path, new Vector());
    ((Vector)t.get(path)).addElement(f);
  }
  public void writePackedFile(String fname) {
    DataOutputStream packed = IO.dosOpen(fname);
    try {
      packed.writeBytes("###Old Separator:" +
        SourceCodeFile.filesep + "###n");
    } catch(IOException e) {
      Pr.error("Writing separator to " + fname);
    }
    Enumeration e = t.keys();
    while(e.hasMoreElements()) {
      String dir = (String)e.nextElement();
      System.out.println(
        "Writing directory " + dir);
      Vector v = (Vector)t.get(dir);
      for(int i = 0; i < v.size(); i++) {
        SourceCodeFile f =
          (SourceCodeFile)v.elementAt(i);
        f.writePacked(packed);
      }
    }
    IO.close(packed);
  }
  // Write all the files in their directories:
  public void write() {
    Enumeration e = t.keys();
    while(e.hasMoreElements()) {
      String dir = (String)e.nextElement();
      Vector v = (Vector)t.get(dir);
      for(int i = 0; i < v.size(); i++) {
        SourceCodeFile f =
          (SourceCodeFile)v.elementAt(i);
        f.writeFile(rootpath);
      }
      // Add file indicating file quantity
      // written to this directory as a check:
      IO.close(IO.dosOpen(
        new File(new File(rootpath, dir),
          Integer.toString(v.size())+".files")));
    }
  }
}
 
public class CodePackager {
  private static final String usageString =
  "usage: java CodePackager packedFileName" +
  "nExtracts source code files from packed n" +
  "version of Tjava.doc sources into " +
  "directories off current directoryn" +
  "java CodePackager packedFileName newDirn" +
  "Extracts into directories off newDirn" +
  "java CodePackager -p source.txt packedFile" +
  "nCreates packed version of source files" +
  "nfrom text version of Tjava.doc";
  private static void usage() {
    System.err.println(usageString);
    System.exit(1);
  }
  public static void main(String[] args) {
    if(args.length == 0) usage();
    if(args[0].equals("-p")) {
      if(args.length != 3)
        usage();
      createPackedFile(args);
    }
    else {
      if(args.length > 2)
        usage();
      extractPackedFile(args);
    }
  }
  private static String currentLine;
  private static BufferedReader in;
  private static DirMap dm;
  private static void
  createPackedFile(String[] args) {
    dm = new DirMap();
    in = IO.disOpen(args[1]);
    try {
      while((currentLine = in.readLine())
          != null) {
        if(currentLine.startsWith(
            SourceCodeFile.startMarker)) {
          dm.add(new SourceCodeFile(
                   currentLine, in));
        }
        else if(currentLine.startsWith(
            SourceCodeFile.endMarker))
          Pr.error("file has no start marker");
        // Else ignore the input line
      }
    } catch(IOException e) {
      Pr.error("Error reading " + args[1]);
    }
    IO.close(in);
    dm.writePackedFile(args[2]);
  }
  private static void
  extractPackedFile(String[] args) {
    if(args.length == 2) // Alternate directory
      dm = new DirMap(args[1]);
    else // Current directory
      dm = new DirMap();
    in = IO.disOpen(args[0]);
    String s = null;
    try {
       s = in.readLine();
    } catch(IOException e) {
      Pr.error("Cannot read from " + in);
    }
    // Capture the separator used in the system
    // that packed the file:
    if(s.indexOf("###Old Separator:") != -1 ) {
      String oldsep = s.substring(
        "###Old Separator:".length());
      oldsep = oldsep.substring(
        0, oldsep. indexOf('#'));
      SourceCodeFile.oldsep = oldsep;
    }
    SourceCodeFile sf = new SourceCodeFile(in);
    while(sf.hasFile()) {
      dm.add(sf);
      sf = new SourceCodeFile(in);
    }
    dm.write();
  }
} ///:~ 

You’ll


first notice the


package

statement that is commented out. Since this is the first program in the


chapter, the


package

statement


is necessary to tell


CodePackager

that


the chapter has changed, but putting it in a package would be a problem. When


you create a


package

,


you tie the resulting program to a particular directory structure, which is


fine for most of the examples in this book. Here, however, the


CodePackager

program must be compiled and run from an arbitrary directory, so the


package

statement is commented out. It will still


look

like an ordinary


package

statement to


CodePackager

,


though, since the program isn’t sophisticated enough to detect multi-line


comments. (It has no need for such sophistication, a fact that comes in handy


here.)

The


first two classes are support/utility classes designed to make the rest of the


program more consistent to write and easier to read. The first,


Pr

,


is similar to the ANSI C library


perror

,


since it prints an error message (but also exits the program). The second class


encapsulates the creation of files, a process that was shown in Chapter 10 as


one that rapidly becomes verbose and annoying. In Chapter 10, the proposed


solution created new classes, but here


static

method


calls are used. Within those methods the appropriate exceptions are caught and


dealt with. These methods make the rest of the code much cleaner to read.

The


first class that helps solve the problem is


SourceCodeFile

,


which represents all the information (including the contents, file name, and


directory) for one source code file in the book. It also contains a set of


String

constants representing the markers that start and end a file, a marker used


inside the packed file, the current system’s end-of-line separator and


file path separator (notice the use of


System.getProperty( )

to get the local version), and a copyright notice, which is extracted from the


following file


Copyright.txt

.

//////////////////////////////////////////////////
// Copyright (c) Bruce Eckel, 1998
// Source code file from the book "Thinking in Java"
// All rights reserved EXCEPT as allowed by the
// following statements: You may freely use this file
// for your own work (personal or commercial),
// including modifications and distribution in
// executable form only. Permission is granted to use
// this file in classroom situations, including its
// use in presentation materials, as long as the book
// "Thinking in Java" is cited as the source. 
// Except in classroom situations, you may not copy
// and distribute this code; instead, the sole
// distribution point is http://www.BruceEckel.com 
// (and official mirror sites) where it is
// freely available. You may not remove this
// copyright and notice. You may not distribute
// modified versions of the source code in this
// package. You may not use this file in printed
// media without the express permission of the
// author. Bruce Eckel makes no representation about
// the suitability of this software for any purpose.
// It is provided "as is" without express or implied
// warranty of any kind, including any implied
// warranty of merchantability, fitness for a
// particular purpose or non-infringement. The entire
// risk as to the quality and performance of the
// software is with you. Bruce Eckel and the
// publisher shall not be liable for any damages
// suffered by you or any third party as a result of
// using or distributing software. In no event will
// Bruce Eckel or the publisher be liable for any
// lost revenue, profit, or data, or for direct,
// indirect, special, consequential, incidental, or
// punitive damages, however caused and regardless of
// the theory of liability, arising out of the use of
// or inability to use software, even if Bruce Eckel
// and the publisher have been advised of the
// possibility of such damages. Should the software
// prove defective, you assume the cost of all
// necessary servicing, repair, or correction. If you
// think you've found an error, please email all
// modified files with clearly commented changes to:
// Bruce@EckelObjects.com. (please use the same
// address for non-code errors found in the book).
//////////////////////////////////////////////////

When


extracting files from a packed file, the file separator of the system that


packed the file is also noted, so it can be replaced with the correct one for


the local system.

The


subdirectory name for the current chapter is kept in the field


chapter

,


which is initialized to


c02

.


(You’ll notice that the listing in Chapter 2 doesn’t contain a


package statement.) The only time that the


chapter

field changes is when a


package

statement is discovered in the current file.


Building
a packed file

The


first constructor is used to extract a file from the ASCII text version of this


book. The calling code (which appears further down in the listing) reads each


line in until it finds one that matches the beginning of a listing. At that


point, it creates a new


SourceCodeFile

object, passing it the first line (which has already been read by the calling


code) and the

BufferedReader
object from which to extract the rest of the source code listing.

At


this point, you begin to see heavy use of the


String

methods. To extract the file name, the overloaded version of

substring( )
is called that takes the starting offset and goes to the end of the
String.
This starting index is produced by finding the
length( )
of the
startMarker.
trim( )
removes white space from both ends of the
String.
The first line can also have words after the name of the file; these are
detected using indexOf( ),
which returns -1 if it cannot find the character you’re looking for and
the value where the first instance of that character is found if it does.
Notice there is also an overloaded version of
indexOf( )
that takes a
String
instead of a character.

Once


the file name is parsed and stored, the first line is placed into the


contents
String

(which is used to hold the entire text of the source code listing). At this


point, the rest of the lines are read and concatenated into the


contents
String

.


It’s not quite that simple, since certain situations require special


handling. One case is error checking: if you run into a


startMarker

,


it means that no end marker was placed at the end of the listing that’s


currently being collected. This is an error condition that aborts the program.

The


second special case is the


package

keyword. Although Java is a free-form language, this program requires that the


package

keyword be at the beginning of the line. When the


package

keyword is seen, the package name is extracted by looking for the space at the


beginning and the semicolon at the end. (Note that this could also have been


performed in a single operation by using the overloaded


substring( )

that takes both the starting and ending indexes.) Then the dots in the package


name are replaced by the file separator, although an assumption is made here


that the file separator is only one character long. This is probably true on


all systems, but it’s a place to look if there are problems.

The


default behavior is to concatenate each line to


contents

,


along with the end-of-line string, until the


endMarker

is discovered, which indicates that the constructor should terminate. If the


end of the file is encountered before the


endMarker

is seen, that’s an error.


Extracting
from a packed file

The


second constructor is used to recover the source code files from a packed file.


Here, the calling method doesn’t have to worry about skipping over the


intermediate text. The file contains all the source-code files, placed


end-to-end. All you need to hand to this constructor is the


BufferedReader

where the information is coming from, and the constructor takes it from there.


There is some meta-information, however, at the beginning of each listing, and


this is denoted by the


packMarker

.


If the


packMarker

isn’t there, it means the caller is mistakenly trying to use this


constructor where it isn’t appropriate.

Once


the


packMarker

is found, it is stripped off and the directory name (terminated by a ‘


#

’)


and the file name (which goes to the end of the line) are extracted. In both


cases, the old separator character is replaced by the one that is current to


this machine using the


String
replace( )

method. The old separator is placed at the beginning of the packed file, and


you’ll see how that is extracted later in the listing.

The


rest of the constructor is quite simple. It reads and concatenates each line to


the


contents

until the


endMarker

is found.


Accessing
and writing the listings

The


next set of methods are simple accessors:


directory( )

,


filename( )

(notice the method can have the same spelling and capitalization as the field)


and


contents( )

,


and


hasFile( )

to indicate whether this object contains a file or not. (The need for this will


be seen later.)

The


final three methods are concerned with writing this code listing into a file,


either a packed file via


writePacked( )

or a Java source file via


writeFile( )

.


All


writePacked( )

needs is the


DataOutputStream,

which was opened elsewhere, and represents the file that’s being written.


It puts the header information on the first line and then calls


writeBytes( )

to write


contents

in a “universal” format.

When


writing the Java source file, the file must be created. This is done via


IO.psOpen( )

,


handing it a

File
object that contains not only the file name but also the path. But the question
now is: does this path exist? The user has the option of placing all the source
code directories into a completely different subdirectory, which might not even
exist. So before each file is written,
File.mkdirs( )
is called with the path that you want to write the file into. This will make
the entire path all at once.


Containing
the entire collection of listings

It’s


convenient to organize the listings as subdirectories while the whole


collection is being built in memory. One reason is another sanity check: as


each subdirectory of listings is created, an additional file is added whose


name contains the number of files in that directory.

The


DirMap

class produces this effect and demonstrates the concept of a


“multimap.” This is implemented using a

Hashtable
whose keys are the subdirectories being created and whose values are
Vector
objects containing the
SourceCodeFile
objects in that particular directory. Thus, instead of mapping a key to a
single value, the “multimap” maps a key to a set of values via the
associated
Vector.
Although this sounds complex, it’s remarkably straightforward to
implement. You’ll see that most of the size of the
DirMap
class is due to the portions that write to files, not to the
“multimap” implementation.

There


are two ways you can make a


DirMap

:


the default constructor assumes that you want the directories to branch off of


the current one, and the second constructor lets you specify an alternate


absolute path for the starting directory.

The


add( )

method is where quite a bit of dense action occurs. First, the


directory( )

is extracted from the


SourceCodeFile

you want to add, and then the


Hashtable

is examined to see if it contains that key already. If not, a new


Vector

is added to the


Hashtable

and associated with that key. At this point, the


Vector

is there, one way or another, and it is extracted so the


SourceCodeFile

can be added. Because

Vectors
can be easily combined with
Hashtables
like this, the power of both is amplified.

Writing


a packed file involves opening the file to write (as a

DataOutputStream
so the data is universally recoverable) and writing the header information
about the old separator on the first line. Next, an
Enumeration
of the
Hashtable
keys is produced and stepped through to select each directory and to fetch the
Vector
associated with that directory so each
SourceCodeFile
in that
Vector
can be written to the packed file.

Writing


the Java source files to their directories in


write( )

is


almost identical to


writePackedFile( )

since both methods simply call the appropriate method in


SourceCodeFile

.


Here, however, the root path is passed into


SourceCodeFile.writeFile( )

and when all the files have been written the additional file with the name


containing the number of files is also written.


The
main program

The


previously described classes are used within


CodePackager

.


First you see the usage string that gets printed whenever the end user invokes


the program incorrectly, along with the


usage( )

method that calls it and exits the program. All


main( )

does is determine whether you want to create a packed file or extract from one,


then it ensures the arguments are correct and calls the appropriate method.

When


a packed file is created, it’s assumed to be made in the current


directory, so the


DirMap

is created using the default constructor. After the file is opened each line is


read and examined for particular conditions:

  1. If
    the line starts with the starting marker for a source code listing, a new
    SourceCodeFile
    object is created. The constructor reads in the rest of the source listing. The
    handle that results is directly added to the
    DirMap.
  2. If
    the line starts with the end marker for a source code listing, something has
    gone wrong, since end markers should be found only by the
    SourceCodeFile
    constructor.

When


extracting a packed file, the extraction can be into the current directory or


into an alternate directory, so the


DirMap

object is created accordingly. The file is opened and the first line is read.


The old file path separator information is extracted from this line. Then the


input is used to create the first


SourceCodeFile

object, which is added to the


DirMap

.


New


SourceCodeFile

objects are created and added as long as they contain a file. (The last one


created will simply return when it runs out of input and then


hasFile( )

will return false.)


Checking
capitalization style

Although


the previous example can come in handy as a guide for some project of your own


that involves text processing, this project will be directly useful because it


performs a style check to make sure that your capitalization conforms to the


de-facto Java style. It opens each


.java

file in the current directory and extracts all the class names and identifiers,


then shows you if any of them don’t meet the Java style.

For


the program to operate correctly, you must first build a class name repository


to hold all the class names in the standard Java library. You do this by moving


into all the source code subdirectories for the standard Java library and


running


ClassScanner

in each subdirectory. Provide as arguments the name of the repository file


(using the same path and name each time) and the


-a

command-line option to indicate that the class names should be added to the


repository.

To


use the program to check your code, run it and hand it the path and name of the


repository to use. It will check all the classes and identifiers in the current


directory and tell you which ones don’t follow the typical Java


capitalization style.

You


should be aware that the program isn’t perfect; there a few times when it


will point out what it thinks is a problem but on looking at the code


you’ll see that nothing needs to be changed. This is a little annoying,


but it’s still much easier than trying to find all these cases by staring


at your code.

The


explanation immediately follows the listing:

//: ClassScanner.java
// Scans all files in directory for classes
// and identifiers, to check capitalization.
// Assumes properly compiling code listings.
// Doesn't do everything right, but is a very
// useful aid.
import java.io.*;
import java.util.*;
 
class MultiStringMap extends Hashtable {
  public void add(String key, String value) {
    if(!containsKey(key))
      put(key, new Vector());
    ((Vector)get(key)).addElement(value);
  }
  public Vector getVector(String key) {
    if(!containsKey(key)) {
      System.err.println(
        "ERROR: can't find key: " + key);
      System.exit(1);
    }
    return (Vector)get(key);
  }
  public void printValues(PrintStream p) {
    Enumeration k = keys();
    while(k.hasMoreElements()) {
      String oneKey = (String)k.nextElement();
      Vector val = getVector(oneKey);
      for(int i = 0; i < val.size(); i++)
        p.println((String)val.elementAt(i));
    }
  }
}
 
public class ClassScanner {
  private File path;
  private String[] fileList;
  private Properties classes = new Properties();
  private MultiStringMap
    classMap = new MultiStringMap(),
    identMap = new MultiStringMap();
  private StreamTokenizer in;
  public ClassScanner() {
    path = new File(".");
    fileList = path.list(new JavaFilter());
    for(int i = 0; i < fileList.length; i++) {
      System.out.println(fileList[i]);
      scanListing(fileList[i]);
    }
  }
  void scanListing(String fname) {
    try {
      in = new StreamTokenizer(
          new BufferedReader(
            new FileReader(fname)));
      // Doesn't seem to work:
      // in.slashStarComments(true);
      // in.slashSlashComments(true);
      in.ordinaryChar('/');
      in.ordinaryChar('.');
      in.wordChars('_', '_');
      in.eolIsSignificant(true);
      while(in.nextToken() !=
            StreamTokenizer.TT_EOF) {
        if(in.ttype == '/')
          eatComments();
        else if(in.ttype ==
                StreamTokenizer.TT_WORD) {
          if(in.sval.equals("class") ||
             in.sval.equals("interface")) {
            // Get class name:
               while(in.nextToken() !=
                     StreamTokenizer.TT_EOF
                     && in.ttype !=
                     StreamTokenizer.TT_WORD)
                 ;
               classes.put(in.sval, in.sval);
               classMap.add(fname, in.sval);
          }
          if(in.sval.equals("import") ||
             in.sval.equals("package"))
            discardLine();
          else // It's an identifier or keyword
            identMap.add(fname, in.sval);
        }
      }
    } catch(IOException e) {
      e.printStackTrace();
    }
  }
  void discardLine() {
    try {
      while(in.nextToken() !=
            StreamTokenizer.TT_EOF
            && in.ttype !=
            StreamTokenizer.TT_EOL)
        ; // Throw away tokens to end of line
    } catch(IOException e) {
      e.printStackTrace();
    }
  }
  // StreamTokenizer's comment removal seemed
  // to be broken. This extracts them:
  void eatComments() {
    try {
      if(in.nextToken() !=
         StreamTokenizer.TT_EOF) {
        if(in.ttype == '/')
          discardLine();
        else if(in.ttype != '*')
          in.pushBack();
        else
          while(true) {
            if(in.nextToken() ==
              StreamTokenizer.TT_EOF)
              break;
            if(in.ttype == '*')
              if(in.nextToken() !=
                StreamTokenizer.TT_EOF
                && in.ttype == '/')
                break;
          }
      }
    } catch(IOException e) {
      e.printStackTrace();
    }
  }
  public String[] classNames() {
    String[] result = new String[classes.size()];
    Enumeration e = classes.keys();
    int i = 0;
    while(e.hasMoreElements())
      result[i++] = (String)e.nextElement();
    return result;
  }
  public void checkClassNames() {
    Enumeration files = classMap.keys();
    while(files.hasMoreElements()) {
      String file = (String)files.nextElement();
      Vector cls = classMap.getVector(file);
      for(int i = 0; i < cls.size(); i++) {
        String className =
          (String)cls.elementAt(i);
        if(Character.isLowerCase(
             className.charAt(0)))
          System.out.println(
            "class capitalization error, file: "
            + file + ", class: "
            + className);
      }
    }
  }
  public void checkIdentNames() {
    Enumeration files = identMap.keys();
    Vector reportSet = new Vector();
    while(files.hasMoreElements()) {
      String file = (String)files.nextElement();
      Vector ids = identMap.getVector(file);
      for(int i = 0; i < ids.size(); i++) {
        String id =
          (String)ids.elementAt(i);
        if(!classes.contains(id)) {
          // Ignore identifiers of length 3 or
          // longer that are all uppercase
          // (probably static final values):
          if(id.length() >= 3 &&
             id.equals(
               id.toUpperCase()))
            continue;
          // Check to see if first char is upper:
          if(Character.isUpperCase(id.charAt(0))){
            if(reportSet.indexOf(file + id)
                == -1){ // Not reported yet
              reportSet.addElement(file + id);
              System.out.println(
                "Ident capitalization error in:"
                + file + ", ident: " + id);
            }
          }
        }
      }
    }
  }
  static final String usage =
    "Usage: n" +
    "ClassScanner classnames -an" +
    "tAdds all the class names in this n" +
    "tdirectory to the repository file n" +
    "tcalled 'classnames'n" +
    "ClassScanner classnamesn" +
    "tChecks all the java files in this n" +
    "tdirectory for capitalization errors, n" +
    "tusing the repository file 'classnames'";
  private static void usage() {
    System.err.println(usage);
    System.exit(1);
  }
  public static void main(String[] args) {
    if(args.length < 1 || args.length > 2)
      usage();
    ClassScanner c = new ClassScanner();
    File old = new File(args[0]);
    if(old.exists()) {
      try {
        // Try to open an existing 
        // properties file:
        InputStream oldlist =
          new BufferedInputStream(
            new FileInputStream(old));
        c.classes.load(oldlist);
        oldlist.close();
      } catch(IOException e) {
        System.err.println("Could not open "
          + old + " for reading");
        System.exit(1);
      }
    }
    if(args.length == 1) {
      c.checkClassNames();
      c.checkIdentNames();
    }
    // Write the class names to a repository:
    if(args.length == 2) {
      if(!args[1].equals("-a"))
        usage();
      try {
        BufferedOutputStream out =
          new BufferedOutputStream(
            new FileOutputStream(args[0]));
        c.classes.save(out,
          "Classes found by ClassScanner.java");
        out.close();
      } catch(IOException e) {
        System.err.println(
          "Could not write " + args[0]);
        System.exit(1);
      }
    }
  }
}
 
class JavaFilter implements FilenameFilter {
  public boolean accept(File dir, String name) {
    // Strip path information:
    String f = new File(name).getName();
    return f.trim().endsWith(".java");
  }
} ///:~ 

The


class


MultiStringMap

is a tool that allows you to map a group of strings onto each key entry. As in


the previous example, it uses a

Hashtable
(this time with inheritance) with the key as the single string that’s
mapped onto the
Vector
value. The
add( )
method simply checks to see if there’s a key already in the
Hashtable,
and if not it puts one there. The
getVector( )
method produces a
Vector
for a particular key, and
printValues( ),
which is primarily useful for debugging, prints out all the values
Vector
by
Vector.

To


keep life simple, the class names from the standard Java libraries are all put


into a

Properties
object (from the standard Java library). Remember that a
Properties
object is a
Hashtable
that holds only

String

objects for both the key and value entries. However, it can be saved to disk
and restored from disk in one method call, so it’s ideal for the
repository of names. Actually, we need only a list of names, and a
Hashtable
can’t accept
null
for either its key or its value entry. So the same object will be used for both
the key and the value.

For


the classes and identifiers that are discovered for the files in a particular


directory, two


MultiStringMap

s


are used:


classMap

and


identMap

.


Also, when the program starts up it loads the standard class name repository


into the


Properties

object


called


classes

,


and when a new class name is found in the local directory that is also added to


classes

as


well as to


classMap

.


This way,


classMap

can be used to step through all the classes in the local directory, and


classes

can be used to see if the current token is a class name (which indicates a


definition of an object or method is beginning, so grab the next tokens –


until a semicolon – and put them into


identMap

).

The


default constructor for


ClassScanner

creates a list of file names (using the


JavaFilter

implementation of

FilenameFilter,
as described in Chapter 10). Then it calls
scanListing( )
for each file name.

Inside


scanListing( )

the source code file is opened and turned into a

StreamTokenizer.
In the documentation, passing
true
to
slashStarComments( )
and
slashSlashComments( )
is supposed to strip those comments out, but this seems to be a bit flawed (it
doesn’t quite work in Java 1.0
).
Instead, those lines are commented out and the comments are extracted by
another method. To do this, the ‘
/
must be captured as an ordinary character rather than letting the
StreamTokenizer
absorb it as part of a comment, and the
ordinaryChar( )
method tells the
StreamTokenizer
to
do
this. This is also true for dots (‘
.’),
since we want to have the method calls pulled apart into individual
identifiers. However, the underscore, which is ordinarily treated by
StreamTokenizer
as an individual character, should be left as part of identifiers since it
appears in such
static
final
values as
TT_EOF
etc., used in this very program. The
wordChars( )
method
takes a range of characters you want to add to those that are left inside a
token that is being parsed as a word. Finally, when parsing for one-line
comments or discarding a line we need to know when an end-of-line occurs, so by
calling
eolIsSignificant(true)
the eol will show up rather than being absorbed by the
StreamTokenizer.

The


rest of


scanListing( )

reads and reacts to tokens until the end of the file, signified when


nextToken( )

returns the


final
static

value


StreamTokenizer.TT_EOF

.

If


the token is a


/


it is potentially a comment, so


eatComments( )

is called to deal with it. The only other situation we’re interested in


here is if it’s a word, of which there are some special cases.

If


the word is


class

or


interface

then the next token represents a class or interface name, and it is put into


classes

and


classMap

.


If the word is


import

or


package

,


then we don’t want the rest of the line. Anything else must be an


identifier (which we’re interested in) or a keyword (which we’re


not, but they’re all lowercase anyway so it won’t spoil things to


put those in). These are added to


identMap

.

The


discardLine( )

method is a simple tool that looks for the end of a line. Note that any time


you get a new token, you must check for the end of the file.

The


eatComments( )

method is called whenever a forward slash is encountered in the main parsing


loop. However, that doesn’t necessarily mean a comment has been found, so


the next token must be extracted to see if it’s another forward slash (in


which case the line is discarded) or an asterisk. But if it’s neither of


those, it means the token you’ve just pulled out is needed back in the


main parsing loop! Fortunately, the

pushBack( )
method allows you to “push back” the current token onto the input
stream so that when the main parsing loop calls
nextToken( )
it will get the one you just pushed back.

For


convenience, the


classNames( )

method produces an array of all the names in the


classes

collection. This method is not used in the program but is helpful for debugging.

The


next two methods are the ones in which the actual checking takes place. In


checkClassNames( )

,


the class names are extracted from the


classMap

(which, remember, contains only the names in this directory, organized by file


name so the file name can be printed along with the errant class name). This is


accomplished by pulling each associated


Vector

and stepping through that, looking to see if the first character is lower case.


If so, the appropriate error message is printed.

In


checkIdentNames( )

,


a similar approach is taken: each identifier name is extracted from


identMap

.


If the name is not in the


classes

list, it’s assumed to be an identifier or keyword. A special case is


checked: if the identifier length is 3 or more


and

all the characters are uppercase, this identifier is ignored because it’s


probably a


static
final

value such as


TT_EOF

.


Of course, this is not a perfect algorithm, but it assumes that you’ll


eventually notice any all-uppercase identifiers that are out of place.

Instead


of reporting every identifier that starts with an uppercase character, this


method keeps track of which ones have already been reported in a


Vector

called


reportSet( )

.


This treats the


Vector

as a “set” that tells you whether an item is already in the set.


The item is produced by concatenating the file name and identifier. If the


element isn’t in the set, it’s added and then the report is made.

The


rest of the listing is comprised of


main( )

,


which busies itself by handling the command line arguments and figuring out


whether you’re building a repository of class names from the standard


Java library or checking the validity of code you’ve written. In both


cases it makes a


ClassScanner

object.

Whether


you’re building a repository or using one, you must try to open the


existing repository. By making a

File
object and testing for existence, you can decide whether to open the file and
load( )
the
Properties
list
classes
inside
ClassScanner.
(The classes from the repository add to, rather than overwrite, the classes
found by the
ClassScanner
constructor.) If you provide only one command-line argument it means that you
want to perform a check of the class names and identifier names, but if you
provide two arguments (the second being “
-a”)
you’re
building a class name repository. In this case, an output file is opened and
the method
Properties.save( )
is used to write the list into a file, along with a string that provides header
file information.
Contents

|

Prev

|

Next
CodeGuru Logo

CodeGuru covers topics related to Microsoft-related software development, mobile development, database management, and web application programming. In addition to tutorials and how-tos that teach programmers how to code in Microsoft-related languages and frameworks like C# and .Net, we also publish articles on software development tools, the latest in developer news, and advice for project managers. Cloud services such as Microsoft Azure and database options including SQL Server and MSSQL are also frequently covered.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.