Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js
By Chetan Jagatkishore Kothari and Shalini Nautiyal
In enterprise web applications, web servers are the gateway for the entire request generated from multiple channels like web browsers, mobile devices, etc. Most of web servers provide an elaborate logging mechanism and can be configured to capture valuable information regarding web request in weblogs.
Web logs request specific information captured by a web server in a production environment that can be analyzed to provide business insights on multiple performance and capacity metrics, like the number of hits, page views, response time, number of transactions, concurrent users, transaction details, think time, etc. for multiple scenarios like SLA Monitoring, Performance Test Strategy Definition and Capacity Planning, etc. We can also derive meaningful insights on web site visitors' behavior and get complete website usage statistics by analyzing log files, which can help to attract more visitors to the site and convert those visitors to satisfied customers.
To retrieve all this information from the web logs, developers need to design and implement Web Log Parser, which can validate and parse these web logs to derive valuable business insights. Log Parser gives you a way to create a data processing pipeline by mixing and matching input formats and output formats as needed, using a query written in a dialect of the SQL language to create the plumbing between these two components.
These features make LINQ a natural fit for parsing web log files. Here in this article, we give an overview of analyzing a log file using LINQ based Log Parser.
Log Parser Requirements for Log Analysis
Log Parser should be designed to address the following requirements:
- Log Parser should be designed to support standard web server log file formats viz. IISW3C, W3C and NCSA
- Log Parser should provide the mechanism to specify a custom format and parser log file as per the defined custom format.
- Log Parser should be designed to provide query access to text based log data for searching specific information in a particular set of data.
- Log Parser should provide the mechanism to format the results of the query in text based output.
- Log Parser should provide the mechanism to persist parsed data into a database or memory as per the requirement.
- Log Parser should be designed considering performance requirements involved in parsing historical log files.
Leveraging LINQ to Implement Log Parser
Dealing with text files to import / export or filter information is not new, however with the introduction of LINQ under the .NET 3.5 framework, we can now utilize its power to deal with text files in a more structured manner. LINQ introduces a more declarative coding paradigm.
Although LINQ has been more famously known for its querying capability with SQL and LINQ to Objects, however it’s a clever choice for implementation as a Log Parser. LINQ can be very efficiently leveraged for parsing web logs generated by a web server. Using LINQ can make it extremely simple for developers to parse web logs, which are delimited structured text files, and provide meaningful information about an application's behavior in production.
Technical Overview of LINQ
- LINQ adds standard patterns for querying and updating data in any type of data store—from SQL databases, text files, in-memory collections, to XML documents, to .NET objects.
- LINQ to SQL enables you to treat data in your applications as native objects in the programming language you are using, abstracting the complexity of relational data management and database connections
- LINQ O/R mapping issues by making query operations like SQL statements part of the programming language. Adding to its power, LINQ is extensible and can be used to query various data sources.
Log Parser Using LINQ Implementation
In order to perform weblog analysis there are a few steps you need to perform before running the core LINQ code for parsing logs. The following section explains in brief on how we can go ahead and parse a set of web server logs:
- Transfer all the log files generated by the web server in all the nodes of the deployed application to a centralized location.
- Read all the text files into memory or database (depends on your implementation needs), however in our approach shown below we are reading files into memory.
- Once you have the file data, you need to perform Generic Parsing using LINQ in order to fetch all information you will need from the log files.
- With Generic parsed data, you can perform various LINQ queries to fetch meaningful information, which is useful to your customer.
The below code snippets show a generic example of parsing raw log files received from a Web server using LINQ:
1. Method GetAllLogFiles will read all files from the given log location.
2. Once files are available, ParseLogFiles will read each file line by line; these lines are then passed to GenericParser method, which performs the first level of parsing.
Once we get raw data from the log files we can perform various queries in order to generate meaningful information. For example, to find out trend of hits received for a website (hits during a given day or any duration), the following LINQ query can be used.
Public class HitsAnalyzer
Another example can be finding out the top five URLs that received the maximum hits for a given analysis duration.
Below is a sample LINQ query for the same.
Public class HitsAnalyzer
While implementing Log Parser using LINQ, we can also provide flexibility of log format customization. Since different web servers support different log formats it becomes imperative for an intelligent Log Parser to understand the format and derive the results of log analysis accordingly.
During the time of log analysis, it may so happen that you have two or more different log formats or that there is a change in the number of fields being logged in the web logs. A simple example can be the presence of a cookie in one log format and not in another; also there can be difference in split character or comment symbol used in two formats.
Hence in order to provide Log format customization, there are certain activities to be performed before analyzing web server logs:
1) Identification of differences in Log formats to define customization strategy. Most of the times variations in log formats occur with respect to a difference in field positions and formats, e.g. Date field might be the first field in one log format and in another the IP can be the first field or Date can be in yyy-MM-dd format in some log format and in dd-mm-yyyy format for some.
2) Provide users a way to configure log format for the application.
3) At time of analysis, based on analysis duration, identify the log format to be picked.
4) In case of multiple log formats during a given analysis, duration data would be parsed and analyzed per log format and then would need to be merged.
The meaningful statistics and analysis information captured about a site's visitors: page views, activity statistics, accessed files, paths through the site, information about referring pages, search engines, browsers, operating systems, and more can be produced in easy-to-read reports that include both text information (tables) and charts. LINQ provides the capability to bind the log information to standard .NET Web Controls for tables and charts.
This article covered an overview of analyzing log files generated by a web server, leveraging LINQ to derive meaningful business insights. This article will help architects and developers leverage LINQ in an effective way while building and designing website statistics or log analysis solutions.
1. LINQ Overview - http://www.devx.com/DevX/Article/34653/0/page/3
2. Log Analysis - http://en.wikipedia.org/wiki/Log_analysis
Chetan Kothari works as a Principal Architect at the Java Enterprise Center of Excellence at Infosys Technologies, a Global leader in IT & Business Consulting Services. Chetan has over 13 years’ experience with expertise in J2EE application framework development, defining, architecting and implementing large-scale, mission-critical, IT Solutions across a range of industries. He can be reached at email@example.com
Shalini Nautiyal works as a Technology Lead at Infosys Labs. She has over 6 years’ experience in enterprise application design and development. Shalini has expertise in WCF and Scalable Web Applications using Microsoft Technologies. She can be reached at Shalini_nautiyal@infosys.com