Architect__neo1

Big Data Architecture Best Practices

The marketing department of software vendors have done a good job making Big Data go mainstream, whatever that means. The promise of we can achieve anything if we make use of Big Data; business insight and beating our competitions to submission. Yet, there is no well-publicised Big Data successful implementation. The question is: why not? Clearly this silver bullet where businesses have seen billions of dollars invested in but no return on investment! Who is to blame? After all, businesses do not have to publicise their internal processes or projects. I have a different view to that and the cause is on the IT department. Most Big Data projects are driven by the technologist not the business there is create lack of understanding in aligning the architecture with the business vision for the future.

The Preliminary Phase

Big Data projects are not different to any other IT projects. All projects spur out of business needs / requirements. This is not The Matrix; we cannot answer questions which have not been asked yet. Before any work begin or discussion around which technology to use, all stakeholders need to have an understanding of:

  • The organisational context
  • The key drivers and elements of the organisation
  • The requirements for architecture work
  • The architecture principles
  • The framework to be used
  • The relationships between management frameworks
  • The enterprise architecture maturity

In the majority of cases, Big Data projects involves knowing the current business technology landscape; in terms of current and future applications and services:

  • Strategies and business plans
  • Business principles, goals, and drivers
  • Major framework currently implemented in the business
  • Governance and legal frameworks
  • IT strategy
  • Pre-existing Architecture Framework, Organisational Model, and Architecture repository

The Big Data Continuum

Big Data projects are not and should never been executed in isolation. The simple fact that Big Data need to feed from other system means there should a channel of communication open across teams. In order to have a successful architecture, I came up with five simple layers/ stacks to Big Data implementation. To the more technically inclined architect, this would seem obvious:

  • Data sources
  • Big Data ETL
  • Data Services API
  • Application
  • User Interface Services
Big Data Protocol Stack

Data Sources

Current and future applications will produce more and more data which will need to be process in order to gain any competitive advantages from them. Data comes in all sorts but we can categorise them into two:

  1. Structured data – usually stored following a predefined formats such as using known and proven database techniques. Not all structured data are stored in database as there are many businesses using flat files such as Microsoft Excel or Tab Delimited files for storing data
  2. Unstructured data – businesses generates great amount of unstructured data such emails, instant messaging, video conferencing, internet, flat files such documents and images, and the list is endless. We call the data “unstructured” as they do not follow a format which will make facilitate a user to query its content.

I have spent a large part of my career working on Enterprise Search technology before even “Big Data” was coined. Understanding where the data is coming from and in what shape is valuable to a successful implementation of a Big Data ETL project. Before a single a line of programming code is written, architects will have to try and normalise the data to common format.

Big Data ETL

This is the part that excites technologists and especially the development teams. There are so many blogs and articles published every day about Big Data tools that this creates confusions among non-tech people. Everybody is excited about processing petabytes of data using the coolest kid on the block: Hadoop and its ecosystem. Before we get carried away, we first need to put some baseline in place:

  • Real-time processing
  • Batch processing
Big Data – Data Consolidation

The purpose of Extract Transform Load projects, regardless of using Hadoop or not, is to consolidate the data into a single view Master Data Management for querying on demand. Hadoop and its ecosystem deals with the ETL aspect of Big Data not the querying part. The tools used will heavily depends of processing need of the project: either Real-time or batch; i.e. Hadoop is a batch processing framework for large volume of data. Once the data has been processed, the Master Data Management system (MDM) can be stored in a data repository such as NoSQL based or RDBMS – this will only depends on the querying requirements.

Data Services API

As most of the limelight goes to the tools for ETL, a very important area is usually overlooked until later almost as a secondary thought. MDM will need to be stored in a repository in order for the information to be retrieve when needed. In a true Service Oriented Architecture spirit, the data repository should be able to expose some interfaces to external third party applications for data retrieval and manipulation. In the past, MDM were mostly created in RDBMS and retrieval and manipulation were carried out through the use of the Structured Query Language. Well this does not have to change but architects should be aware of other forms of database such NoSQL types. The following questions should be asked when choosing a database solution:

  • Is there are standard query language
  • How do we connect to the database; DB drivers or available web services
  • Will the database scale when the data grows
  • What security mechanism are in place for protecting some or whole data

Other questions specific to the project should also be included in the checklist.

Business Applications

So far, we have extracted the data, transformed and loaded it into a Master Data Management system. The normalised data is now exposed through web services (or DB drivers) to be used by third party applications. Business applications are the reason why to undertake Big Data projects in the first place. Some will argue that we should hire Data Scientists (?). According many blogs, Data Scientist roles is to understand the data, explore the data, prototype (new answers to unknown questions) and evaluate their findings. This is interesting as it reminds me the motion picture The Matrix, where the Architect knew the answers to the questions before Neo has even asked them yet and decides which one are relevant or not. Now this is not how businesses are run. It will be extremely valuable if the data scientist may suggest subconsciously (Inception) a new way to do something but most of the time the questions will come from business to be answered by the Data Scientist or whoever knows the data. The business applications will be the answer to those questions.

User Interfaces Services

User interfaces are the make or break of the project; a badly designed UI will affect adoption regardless of the data behind it, an intuitive design will increase adoption and maybe user will start questioning the quality of the data. Users will access the data differently; mobile, TV and web as an example. Users will usually focus on a certain aspect of the data and therefore they will require the data to be presented in a customised way. Some other users will want the data to be available through their current dashboard and match their current look and feel. As always, security will also be a concern. Enterprise portal have been around for a long time and they are usually used for data integration projects. Nevertheless, standards such as Web Services for Remote Portlets (WSRP) make it possible for User Interfaces to be served through Web Service calls.

Conclusion

This article show the importance of architecting a Big Data project before embarking on the project. The project needs to be in line with the business vision and have a good understanding of the current and future technology landscape. The data needs to bring value to the business and therefore business needs to be involved from the outset. Understanding how the data will be used is key to its success and taking a service oriented architecture approach will ensure that the data can serve many business needs.

Develop Your Own Google with Apache Lucene (Java Nutch Solr)

Apache Lucene is Open Source API that allows a Java developer (.Net libraries available) to write indexing and full-text search capable applications. I have been writing applications based on Lucene for the last 3 years and some of the applications have been deployed at large corporations. I know there are other libraries available to developers who wish to write indexing engine but this blog will solely focus on Apache Lucene. I will not compare it to other API.

Lucene is a very mature API and can be found in NetBeans IDE, Liferay, JackRabbit among others. IBM has written a very good document about the Lucene architecture, therefore I will not delve into it here.

Lucene, alone, is pretty much useless as any other API. Let’s now introduce Nutch. Nutch is a web crawler built-on top of Lucene to provide file crawling capabilities. Nutch was designed to handle large amount of data from the internet (http). Due to its plugin architecture, it was later extended to provide local network crawling such as FTP, databases and Microsoft Windows Shares (I am the author of the protocol-smb plugin and co-author of the index-extra plugin found on the Nutch site). We had extended Nutch and turned it into an Enterprise Search application but most of the source codes were locked behind closed doors (company politics).Anyway, Nutch has evolved to become but still very complex in its inner working. The initial Nutch was developed to process data in a batch but there are ways to turn it real-time but that’s for another day. Ok, so Nutch is good for crawling and indexing of data but it does not handle search directly. There is a web application available with Nutch but it is quite poor so let’s now introduce the Solr.

Solr is a powerful web-based search server built-on top of Lucene. The application was developed by CNET Networks and donated to the Apache Foundation. I believe, not too sure so I might need some references here, Solr was powering the search feature on their site but it is definitely used internally by the company. Late 2009, Lucid Imaginations receives $7.5 million in funding to provide commercial services built around Solr (and Lucene possibly?). Here is a very good presentation about Solr. Solr is a very good indexing engine. The keyword here is “indexing engine”. It does not have any support for crawling data therefore requiring the developer to create applications that will feed it the data to index. I do believe that it is a good feature of the application as it gives the ability to integrate with various systems as long as they can post data over HTTP.

Nutch is a good crawler but it does not provide an enterprise-grade search interface to its data. Solr, in the other hand, is powerful indexer and has an enterprise grade search interface but it does not know how to gather data in its own. I am sure by now it has become obvious how we can integrate them both together.

We want Nutch to gather the data, by-pass its indexing cycle and feed the data directly to Solr. Lucid Imagination has a good tutorial about it here.

After reading the tutorial from Lucid Imagination, you will notice that Nutch is run by executing some bash files. This is something I strongly disagree with. If Nutch is based on Java (an OS independent language), why do we need to execute some UNIX/LINUX shell script. Also, the fact we need to install CygWin on MS Windows platform to be able to run is a big negative for me. I wrote a simple Java application that will launch Nutch and send the indexing to Solr but as you can see in the source code, you still need a UNIX like environment to run successfully. You can write a platform independent version by looking up Nutch API and calling the methods directly.

Well, I hope that this entry help you understand how to use Nutch and Solr built-on top of Apache Lucene. If you need any clarification, leave comments and I will try to gave ASAP if time permits.

package com.etapix.nutchsolr;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.logging.Level;
import java.util.logging.Logger;

/**
 *
 * @author Armel Nene
 */
public class Indexer {

    public static void main(String args[]) {
        if (args.length < 3) {
            System.out.println("Usage:" +
                    "ncrawlName        -   This will be used to store crawler files in CrawlName directory" +
                    "nurlFolder        -   The path to the folder containing the URL to crawl" +
                    "nsolrUrl          -   The URL to the Solr server");
            return;
        }
        String crawlerName = args[0];
        String urlFolder = args[1];
        String solrUrl = args[2];
        String inject = "bash bin/nutch2.sh inject " + crawlerName + "/crawldb " + urlFolder;
        String generate = "bash bin/nutch2.sh generate " + crawlerName + "/crawldb " + crawlerName + "/segments -topN 10 -numFetchers 5";
        String export = crawlerName + "/segments/";

        String invertLinks = "bash bin/nutch2.sh invertlinks " + crawlerName + "/linkdb -dir " + crawlerName + "/segments";
        String indexSolr = "bash bin/nutch2.sh solrindex " + solrUrl + " " + crawlerName + "/crawldb " + crawlerName + "/linkdb " + crawlerName + "/segments/*";
        try {
            System.out.println("Injecting URLs in crawldb");
//            int state = 0;
            InputStream in = Runtime.getRuntime().exec(inject).getInputStream();
            System.out.println(convertStreamToString(in));


//            state = Runtime.getRuntime().exec(inject).waitFor();
//            System.out.println("process completed: " + state);

            for (int i = 0; i < 3; i++) {
                System.out.println("Generating segments");
                in = Runtime.getRuntime().exec(generate).getInputStream();
                System.out.println(convertStreamToString(in));

                System.out.println("Setting environment variable $SEGMENT");
//            String segs = convertStreamToString(Runtime.getRuntime().exec("ls -tr " + crawlerName + "/segments|tail -1").getInputStream());

                String segments = export + lastFileModified(export).getName();
                System.out.println("$segments: " + segments);
//            in = Runtime.getRuntime().exec(export + segs).getInputStream();
            System.out.println(convertStreamToString(in));

                String fetch = "bash bin/nutch2.sh fetch " + segments + " -noParsing";
                String parse = "bash bin/nutch2.sh parse " + segments;
                String update = "bash bin/nutch2.sh updatedb " + crawlerName + "/crawldb " + segments + " -filter -normalize";

                System.out.println("fetch segments");
                in = Runtime.getRuntime().exec(fetch).getInputStream();
                System.out.println(convertStreamToString(in));

                System.out.println("Parse segments");
                in = Runtime.getRuntime().exec(parse).getInputStream();
                System.out.println(convertStreamToString(in));

                System.out.println("Update crawldb");
                in = Runtime.getRuntime().exec(update).getInputStream();
                System.out.println(convertStreamToString(in));
            }
            System.out.println("Inverting links");
            in = Runtime.getRuntime().exec(invertLinks).getInputStream();
            System.out.println(convertStreamToString(in));

            System.out.println("Indexing contents to Solr " + solrUrl);
            in = Runtime.getRuntime().exec(indexSolr).getInputStream();
            System.out.println(convertStreamToString(in));

        } catch (Exception ex) {
            Logger.getLogger(Indexer.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

    public static String convertStreamToString(InputStream is) {
        /*
         * To convert the InputStream to String we use the BufferedReader.readLine()
         * method. We iterate until the BufferedReader return null which means
         * there's no more data to read. Each line will appended to a StringBuilder
         * and returned as String.
         */
        BufferedReader reader = new BufferedReader(new InputStreamReader(is));
        StringBuilder sb = new StringBuilder();

        String line = null;
        System.out.println("Now converting inputstream to text");
        try {
            while ((line = reader.readLine()) != null) {
                sb.append(line + "n");
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                is.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        System.out.println("Finish converting to text");
        return sb.toString();
    }

    public static File lastFileModified(String dir) {
        File fl = new File(dir);
        File[] files = fl.listFiles(new FileFilter() {

            public boolean accept(File file) {
                return file.isDirectory();
            }
        });
        long lastMod = Long.MIN_VALUE;
        File choice = null;
        for (File file : files) {
            if (file.lastModified() > lastMod) {
                choice = file;
                lastMod = file.lastModified();
            }
        }
        return choice;
    }
}