Creating Your Own Search Engine for C/C++ Code Samples

For computer programmers embarking on a project that involves some new API or framework, it is a commonplace occurrence to search out code samples illustrating the various ways in which the API can be utilized. Often this process involves keyword searches describing the API and the functionality that is desired and then performing manual scans of the Web pages associated with the returned links in order to identify code snippets of interest. While a well done code snippet could potentially reveal useful implementation details and save time in the long run, finding relevant snippets can take up a fair share of time in and of itself. Given a programmer's natural tendency toward automating repetitive tasks, this scenario lends one to wonder if the automatic extraction of topically relevant code snippets could be made a reality.

In answer to this question, this article seeks to describe a means of post processing keyword search results using regular expression based pattern matching as a means of automatic C/C++ code extraction. The basis for the technique entails the building of a set of regular expressions designed to recognize C/C++ functions and control structures and applying these regular expressions to topically relevant results obtained from a search API. For example, if you were interested in code pertaining to bubble sorts, you could use the search API to obtain links pertaining to performing bubble sorts in C/C++, request the Web pages associated with each of the links, and then process the Web pages with the regular expressions to extract elements of C/C++ code. By writing the extracted codes and their corresponding links to an output file, you then have a rapid way of identifying and visually inspecting potentially useful code snippets and their corresponding documentation. While the methodology used will be demonstrated with the Perl programming language, any language that supports http requests and regular expressions could be used to similar effect.

Using the Bing Search API

This technique could be adapted to work with any search API, but for purposes of this article, the Bing API will be used. The main rationale for choosing the Bing API to illustrate the functionality of this code is that of the two major search APIs (Bing and BOSS), Bing is still currently free to use, and thus can let developer's explore search APIs without a financial commitment. The Bing API can be interacted with by sending http get requests to the service and receiving the results in an XML, JSON, or SOAP format. In this article we will make use of the XML based results. Bing's XML interface is suitable for applications in which you do not expect your queries to exceed the maximum URI length (2048 characters) and return 414 errors. JSON operates and has limitations similar to the XML interface, but returns results in a way that make it ideal for working with AJAX applications. The SOAP interface is typically reserved for cases in which you suspect that the query lengths and the parameter specifications required by application would exceed the maximum URI length.

Working with the Bing API involves making an http get request that specifies parameters such as your application ID (can be obtained from Bing Developer), your query, and the source you seek to obtain your results from (e.g. the Web, image search, video search, etc). Additionally, there are optional parameters that can be specified as well to control the numbers of results returned, file types, etc.

Now that we have a conceptual understanding of how the Bing API works, let's take a look at a Perl code sample that implements an API request (Listing 1).

Listing 1: Making a request of the Bing API

use LWP;
my $request=LWP::UserAgent->new();
my $response;
$appid="AppId=Your Key Here";
$queryurl=$bing . $appid . "&Query=" . $query . "&Sources=Web&Web.FileType=HTML&Web.Count=1"; 
$results=$response->content; die unless $response->is_success;

The $bing variable is used to store the base URL for making an XML request, while the $appid variable is where the developer would insert the AppID associated with their project (note: after the = sign). These variables are then concatenated along with a specification of our query terms, source type, and the optional parameters of file type and results count, to form the complete get request. Results count (the number of results to be returned) can be specified as any number between 1 and 50. For purposes of this program we are limiting the file type to HTML to ensure that links to PDF files, Word documents, etc. are not returned, since in its current form our application would not be able to process these documents. An LWP get request can then be made of the Bing API to request and obtain the search results and store them in the $results variable.

Now that we have an understanding of how documents can be programmatically requested using the Bing API, let's take a look at the formatting of the API results. Note you can print out the $results variable to do this, or you can enter your get request into any browser to do this. A 1 result query for the term "perl" would yield results as follows (note: Listing 2 somewhat truncated for space issues):

Listing 2: A sample XML response from the Bing API

<SearchResponse Version="2.2">
    <web:Title>The Perl Programming Language -</web:Title>
    <web:Description>The Perl Programming Language at </web:Description>
    <web:CacheUrl> </web:CacheUrl>

For purposes of the C/C++ code search we will be performing, it is important to note that the information pertaining to each individual result is contained between a set of "web:WebResult" tags and the link to the site referred to in each result is found between a set of "web:Url" tags. Thus, since we want to download the Web page associated with each result to process it for C/C++ code, we could extract the links from the Bing search results as follows (Listing 3):

Listing 3: Using an XML parser to extract links from XML formatted Bing API results.

use XML::LibXML;
my $parser=XML::LibXML->new;
my $domtree=$parser->parse_string($results);
   my $i=0;

The above code listing makes use of the parsing module XML::LibXML by taking the XML formatted results returned from the Bing API call ($results) and constructing a DOM tree of the results. The DOM tree is then traversed to identify all of the "web:WebResult" nodes and then, in turn, identify each of their corresponding "web:Url" child nodes that contain the links of interest, where they are stored in the @links array.

Creating and Implementing Regular Expressions that Match C/C++ Code

Now that we have an idea of how the Bing search API can be used in conjunction with our search application, it is time for us to consider how the C/C++ code can be extracted from the Web results returned by Bing. The basis behind the extraction will be the creation of a regular expression capable of matching common C/C++ syntaxes. Regular expressions provide a means of defining textual patterns and matching those patterns against blocks of text. For those who are unfamiliar with regular expressions, a tutorial on their usage can be found at (Dev Shed: Parsing and Regular Expression Basics). It is this idea of patterning that enables us to create regular expressions that can match common C/C++ structures, since the syntax of such structures does follow a set pattern. If we consider a typical C/C++ function and a control structure some similarities become evident (Listing 4).

Listing 4: C/C++ Code Syntax Samples

int myfunc ( ){
   //code here
while ( ) {
   //code here

In both cases (Listing 4), it is evident that a limited vocabulary precedes a set of parenthesis followed by a set of curly braces. Thus if we establish a set of regular expressions that is capable of matching these patterns, we should be able to pick out most C/C++ code that is wrapped in a function or a control structure. It is important to remember that when attempting such complex pattern matching, there will ALWAYS be false positives and false negatives associated with your results sets. The key is to optimize your regular expressions to obtain the balance between false positives and false negatives that is most in line with your goals. If you are concerned with not missing anything you may choose to use fuzzier expressions that will match more things, but leave you with a higher number of false positives. If you want higher quality matches and care less about missing a few good ones, then it may be in your interest to make the patterns more explicit. The point being that, while the regular expressions used in the sample code work for many cases, some modification may be warranted for your particular usage of them.

Listing 5: Defining the Regular Expressions and Using Them to Extract C/C++ Code

use Text::Balanced qw(extract_codeblock);
#delimiter used to distinguish code blocks for use with Text::Balanced
#regex used to match keywords/patterns that precede code blocks
my $regex='(((int|long|double|float|void)\s*?\w{1,25})|if|while|for)';
foreach $link(@links){
      $response=$request->get("$link");  # gets Web page 
      while($results=~s/<script.*?>.*?<\/script>//gsi){}; # filters out Javascript
          $code=$1 . extract_codeblock($results,$delim);
          print OFile "<h3><a href=\"$link\">$link</a></h3> \n";
          print OFile "$code" . "\n" . "\n";

The segment of code in Listing 5 initially loads the Text::Balanced module since this Perl module has the ability of extracting matched sets of quotes, parenthesis, and the like. In this instance we set the delimiter to be used for matching by Text::Balanced to be curly braces, since our C/C++ functions and control structures will be encapsulated in such braces. Next we define a regular expression that is capable of matching the text that could precede any code block comprised of a function or a control structure and store it in the variable $regex. Once this initial setup is complete, the code processed each link by using LWP to request the Web page associated with each link and storing it in the results variable. Next, the Web page is stripped of JavaScript, so that a JavaScript content does not cause false positive matches. After the JavaScript is removed, the "pos" function is used to set the $results string position back to the starting point. $results is then processed with our regular expressions in conjunction with Text::Balanced's extract_codeblock functionality to identify C/C++ code blocks contained in the Web page text. These identified code blocks are then written to an output file specified by the file handle OFile.

Some sample results of this application can be observed in Figure 1. While in some cases, the greedy nature of Perl's regular expression can return a bit more than the desired code, overall the application does a great job of extracting the code blocks from Web pages that allow for the rapid pinpointing of Web pages that may provide valuable instruction on how to implement a programming feature of interest. Moreover, while this article focused on the extraction of C/C++ code, the similar syntaxes of other programming languages could allow for the ready adaptation of such techniques to searching for code blocks from other languages such as Java.

Sample results from the code search application
Figure 1: Sample results from the code search application

Related Articles



  • Jordan shoes mentioned Gene to go for the brand, a segment of Nike

    Posted by TaddyGaffic on 04/22/2013 08:46pm

    In focus groups conducted at 80 community-based organizations around the country, Motivational Educational Entertainment of Philadelphia [url=]nike huarache free[/url] identified disturbing trends among youths ages 16 to 20. The "Just Say No" message of abstinence-only campaigns has been lost on this group of young people, who grew up during an era when the hip-hop sensibility of getting cash and clothes did not incorporate values of conscientious sexual behavior or social responsibility. Rappers who mingle [url=]nike free run 3[/url] with glamorous-looking half-naked women rarely mention contraception while they're listing the number of compromising sexual situations they've been in lately. There are number of good brands in the market like Adidas, Nike, Reebok, Asics, Brooks, Puma etc. Most of these brands have shoes tailored for professional and amateur tennis players. New advanced technologies have made these shoes more player-friendly. After that, I eventually like to try to get down to the 150-160 range. I not entirely sure if that a healthy number for a gal my height or not, and I haven looked into it. But, I know that when I was 180 I still didn feel that [url=]nike free run 3[/url] fit, so I think an extra 20 or so will help with that.. Lotto sneakers selling price are generally liable along with using the form of sneakers you choose on. Your Lotto sneakers established fact because of their style along with good quality plus the create. That they create sneakers coming from all varieties much like the loafers,Puma II Shoes, new sandals, sneakers along with task tennis shoes or anything else

  • iacsmj

    Posted by Suttonzps on 03/30/2013 10:26pm

    Looked at the blood of months to slowly rise to the transit, the hearts of the crowd could not help a tight, quietly seize the weapons in their hands, and began to cheer myself up. Rogge camps large bolt Roger Archer adjustment, alignment army of monsters living in large dark shield, and even Sophia is also commanded Felicity ray ban sunglasses steel girlfriend anytime ready to face the requirements is also free to play. After all, once the war opened ray ban sunglasses sale may be no empty ray ban wayfarers orders, master duel does not allow the slightest distraction unless ray ban new wayfarer eager to see God.ray ban prescription glasses, Bloodmoon finally arrives in the crowd watched transit Scarlet light shining violently again, the whole earth will be dyed Scarlet the countless monsters again followed the shine of Blood Mountain and wantonly noisy, but the roar of abnormal loud, unusual persistent, abnormal shock, the moment the whole world seems to become Amityville Horror.

  • dggrnk

    Posted by Suttonsfv on 03/29/2013 02:01pm

    ray ban sunglasses,Macrophage dragon is simply a super cheat! Repair its body store ingest demon essence, if Xiao Feng assimilate can Xiu SS Cinnabar Avenue. Cher is the S-class the Onio level knife Spirit, and the moon in the water is a Class A kinship Marquis, Ice Class B T-1000 liquid robot, red and Xiaoqian the ghost knife servant of all Class B in red force Xiaoqian a high point, but swallowed Jaap the swordsmith soul of ray ban caravan just after and swallowed the soul of Yan Chi Xia Xiaoqian strength is no less than it the realm really very important to ah! And now the most embarrassing Fifi, relying on the magic of the witch and the ability to read the new school, oakley sunglasses outlet the up is 3C level of ability, the ratio of those little ghosts strong then a little bit, Crackdown tough opponent also ray ban sunglasses sale," title="oakley sunglasses cheap"oakley sunglasses cheap shot of copies.

  • cheap oakley

    Posted by ngliliImpumpfht on 03/28/2013 09:58pm - replica sunglasses fake oakleys - cheap ray ban fake oakleys - cheap fake oakley sunglasses cheap oakleys sunglasses - cheap sunglasses replica oakleys - cheap oakleys fake ray ban wayfarer

  • oakley sunglasses cheap

    Posted by rgliliImpumpdov on 03/28/2013 09:30pm - sunglasses wholesale cheap oakleys - oakleys for cheap cheap ray ban sunglasses - fake oakleys cheap aviator sunglasses - fake oakley sunglasses ray ban sunglasses cheap - oakley sunglasses cheap ray ban sunglasses cheap

  • Zhou Li Xiu 场图 时装 å·´ spring and summer 2013 women registered shell Wei Yi Avenue follicle follicle LV 2013 LV modish Subsection

    Posted by woshizifengRWd on 03/25/2013 04:33am

    Rated 棋盘 题为 major spring-summer 2013 a 发布 Zhou 时装 黎 巴 Louis Vuitton registered Wei Yi Carriageway, decorative organization a undersized non-为必 exert oneself, primordial little one 开格 not 也离 ??nature sketch out a shell helpmeet climbing Wei, lattice large lattice young, Acts Metropolitan 无刻 无时 casual path LV Attachment of one sue 份女 您更. fdfdf dsfdsfsd Zhou Li Xiu 场图 时装 巴 sprightliness and summer 2013 women registered structure Wei Yi Thoroughfare follicle follicle LV 2013 LV new Subsection Rated 棋盘 题为 might spring-summer 2013 a 发布 Zhou 时装 黎 巴 Louis Vuitton registered Wei Yi Road, decorative set-up a pygmy non-为必 lift weights, elemental youngster 开格 not 也离 ??nature make a hull dame climbing Wei, lattice generous lattice teeny, Acts Metropolitan 无刻 无时 easy orbit LV Adjunct of harmonious be attractive to 份女 您更 Dior 迪 终于 开秀 back a [b][/b] goodly Hideyuki whole expected a lap high point receiver 时装 黎 Tomoe, tip 时尚 站在 good fortune again next spring-summer 2013 away 秀 transvestite 迪 Dior. Method of arriving 种穿 organization, critical trends 眥漕|磬赅眢礤|镳彐溴|镥疱鋧?right away 语言 时尚 锋的 于先 genus satisfactorily unique, 拼接 hypsochromic persuade shearing ordinance needlessly foetus 龄女 surprising if 很适 交融 Nonpareil 传统 premised expected, something goodbye 演变 veil instrumentation west 经典 牌 goods; mold Dior 经典 a 缎面 闪亮 裙 half. Floret 哨 Yayu 极简 Yes, 圈可 point 也可 细节 add up shearing surface charge.

  • ugg boots hysibd

    Posted by Mandydik on 01/27/2013 03:26am

    2gKgf nike store lVgi Michael Kors outlet nMvh ugg boots 9nBtg monster beats by dre 6uSaw Cheap nfl jerseys 6fRya uggs sko 6xZnq burberry bags 3tJmw longchamp uk 5pVdn cheap nike free run 5iBtv cheap uggs 7rLgj dr dre headphones 1dMdq ugg baratas 0yEve cheap ghd 2aJnz 1pBsd

  • ghd australia yjohfu

    Posted by Mandyqkq on 01/26/2013 08:42pm

    3vSwx ugg fNli nQbp nike shox 6sGew toms sale 6wFvd burberry outlet 0cGkw bottes ugg 1dZrq sac longchamp solde 5xGys louis vuitton sale 4yHjq michael kors outlet 3hMhc christian louboutin 8lJdi ugg uk 5qUlt cheap nfl jerseys 0nRqa 2vDsq ghd 0hCpx ugg boots uk

  • stvtorft wrfojwuk mixjorzz firjkb

    Posted by rootlyJerie on 11/13/2012 08:48pm

    cwcnoc yhlniy oeruxqlv ralph lauren pas cher iovenqy ovielvr xpfmd Creating Your Own Search Engine for C/C++ Code Samples fygaizr moncler xiqcpfss doudoune moncler aqdicyfv moncler qyrqnpme

  • jwzmvd aqakkd

    Posted by emailmeshaf on 11/12/2012 01:02pm

    Creating Your Own Search Engine for C/C++ Code Samples rvpylg ruvzgro uymegs sac longchamp fuhryad yvwsemge sac longchamp gcwhqqs icmxw air jordan jzpbggzc air jordan wofsitgc abercrombie pas cher ivryxmpo

  • Loading, Please Wait ...

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Entire organizations suffer when their networks can't keep up and new opportunities are put on hold. Waiting on service providers isn't good business. In these examples, learn how to simplify network management so that your organization can better manage costs, adapt quickly to business demands, and seize market opportunities when they arise.

Most Popular Programming Stories

More for Developers

RSS Feeds

Thanks for your registration, follow us on our social networks to keep up-to-date