Insight Analytics: April 2016

Friday, 29 April 2016

JWT Authentication With AngularJS - Video and Tutorial

by Sarang Nagmote

Category - Website Development
More Information & Updates Available at: http://insightanalytics.co.in

Lately I’ve been on the road, giving talks about web application security.JSON Web Tokens (JWTs) are the new hotness, and I’ve been trying to demystifythem and explain how they can be used securely. In the latest iteration of this talk, I give some love to Angular and explain how I’ve solved authentication issues in that framework.

However the guidelines in this talk are applicable to any front-end framework.

Following you will find a video recording of this talk, and a textual write-upof the content. I hope you will find it useful!

Status Quo: Session Identifiers

We’re familiar with the traditional way of doing authentication: a user presentsa username and password. In return, we create a session ID for them and westore this ID in a cookie. This ID is a pointer to the user in the database.

This seems pretty straightforward, so why are JWTs trying to replace thisscheme? Session identifiers have these general limitations:

They’re opaque, and contain no meaning themselves. They’re just pointers.
As such, they can be database heavy: you have to look up user information andpermission information on every request.
If the cookies are not property secured, it is easy to steal (hijack) theuser’s session.

JWTs can give you some options, regarding the database performance issues, butthey are not more secure by default. The major attack vector for session IDsis cookie hijacking. We’re going to spend a lot of time on this issue, becauseJWTs have the same vulnerability.

Cookies, The Right Way ®

JWTs are not more secure by default. Just like session IDs, you need to storethem in the browser. The browser is a hostile environment, but cookies areactually the most secure location — if used properly!

To use cookies securely, you need to do the following:

Only transmit cookies over HTTPS/TLS connections. Set the Secure flag oncookies that you send to the browser, so that the browser never sends thecookie over non-secure connections. This is important, because a lot ofsystems have HTTP redirects in them and they don’t always redirect to theHTTPS version of the URL. This will leak the cookie. Stop this bypreventing it at the browser level with the Secure flag.
Protect yourself against Cross-Site-Scripting attacks (XSS). A well-craftedXSS attack can hijack the user’s cookies. The easiest way to preventXSS-based cookie hijacking is to set the HttpOnly flag on the cookies thatyou send to the browser. This will prevent those cookies from being read bythe JavaScript environment, making it impossible for an XSS attack to read thecookie values.You should also implement proper content escaping, to prevent all forms of XSSattacks. The following resources can point you in the right direction:

Protect Yourself from Cross-Site-Request-Forgery

Compromised websites can make arbitrary GET requests to your web application,and the browser will send along the cookies for your domain. Your servershould not implicitly trust a request, merely because it has session cookies.

You should implement Double Submit Cookies by setting an xsrf-token cookieon login. All AJAX requests from your front-end application should append thevalue of this cookie as the X-XSRF-Token header. This will trigger the Same-Origin-Policy of the browser, and deny cross-domain request.

As such, your server should reject any request that does not see a match betweenthe supplied X-XSRF-Token header and xsrf-token cookie. The value of thecookie should be a highly random, un-guessable string.

For more reading, please see:

Introducing JSON Web Tokens (JWTs)!

Whoo, new stuff! JWTs are a useful addition to your architecture. As we talkabout JWTs, the following terms are useful to define:

Authentication is proving who you are.
Authorization is being granted access to resources.
Tokens are used to persist authentication and get authorization.
JWT is a token format.

What’s in a JWT?

In the wild they look like just another ugly string:

eyJ0eXAiOiJKV1QiLA0KICJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJqb2UiLA0KICJleHAiOjEzMDA4MTkzODAsDQogImh0dHA6Ly9leGFtcGxlLmNvbS9pc19yb290Ijp0cnVlfQ.dBjftJeZ4CVPmB92K27uhbUJU1p1r_wW1gFWFOEjXk

But they do have a three part structure. Each part is a Base64-URL encodedstring:

eyJ0eXAiOiJKV1QiLA0KICJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJqb2UiLA0KICJleHAiOjEzMDA4MTkzODAsDQogImh0dHA6Ly9leGFtcGxlLmNvbS9pc19yb290Ijp0cnVlfQ.dBjftJeZ4CVPmB92K27uhbUJU1p1r_wW1gFWFOEjXk

Base64-decode the parts to see the contents:

Header:

{ "typ":"JWT", "alg":"HS256"}

Claims Body:

{   "iss”:”http://trustyapp.com/”,   "exp": 1300819380,   “sub”: ”users/8983462”,   “scope”: “self api/buy”}

Cryptographic Signature:

tß´—™à%O˜v+nî…SZu¯µ€U…8H×

The Claims Body

The claims body is the best part! It asserts:

Who issued the token (iss).
When it expires (exp).
Who it represents (sub).
What they can do (scope).

Issuing JWTs

Who creates JWTs? You do! Actually, your server does. The following happens:

User has to present credentials to get a token (password, api keys).
Token payload is created, compacted and signed by a private key on yourserver.
The client stores the tokens, and uses them to authenticate requests.

For creating JWTs in Node.js, I have published the nJwt library.

Verifying JWTs

Now that the client has a JWT, it can be used for authentication. When theclient needs to access a protected endpoint, it will supply the token. Yourserver then needs to check the signature and expiration time of the token.Because this doesn’t require any database lookups, you now have statelessauthentication… !!! Whoo!… ?

Yes, this saves trips to your database, and this is really exciting from aperformance standpoint. But.. what if I want to revoke the token immediately,so that it can’t be used anymore – even before the expiration time? Read on tosee how the access and refresh token scheme can help.

JWT + Access Tokens and Refresh Tokens = OAuth2?

Just to be clear: there is not a direct relationship between OAuth2 and JWT.OAuth2 is an authorization framework, that prescribes the need for tokens. Itleaves the token format undefined, but most people are using JWT.

Conversely, using JWTs does not require a full-blown OAuth2 implementation.

In other words, like “all great artists”, we’re going to steal a good part fromthe OAuth2 spec: the access token and refresh token paradigm.

The scheme works like this:

On login, the client is given an access token and refresh token.
The access token expires before refresh token.
New access tokens are obtained with the refresh token.
Access tokens are trusted by signature and expiration (stateless).
Refresh tokens are checked for revocation (requires database of issued refreshtokens).

In other words: The scheme gives you time-based control over this trade-off:stateless trust vs. database lookup.

Some examples help to clarify the point:

Super-Secure Banking Application. If you set the access token expirationto 1 minute, and the refresh token to 30 minutes: the user will be refreshingnew access tokens every minute (giving an attacker less than 1 minute to usea hijacked token) and the session will be force terminated after 30 minutes.
Not-So-Sensitive Social/Mobile/Toy Application. In this situation youdon’t expose any personally identifiable information in your application, andyou want to use as few server-side resources as possible. You set the accesstoken expiration to 1 hour (or longer), and refresh token expiration to 4years (the lifetime of a smart phone, if you’re frugal).

Storing & Transmitting JWTs (in the Browser)

As we’ve seen, JWTs provide some cool features for our server architecture. Butthe client still needs to store these tokens in a secure location. For thebrowser, this means we have a struggle ahead of us.

The Trade-Offs and Concerns to Be Aware Of:

Local Storage is not secure. It has edge cases with the Same Origin Policy,and it’s vulnerable to XSS attacks.
Cookies ARE secure, with HttpOnly, Secure flags, and CSRF prevention.
Using the Authorization to transmit the token is fun but not necessary.
Cross-domain requests are always hell.

My Recommended Trade-Offs:

Store the tokens in cookies with HttpOnly, Secure flags, and CSRFprotection. CSRF protection is easy to get right, XSS protection is easy toget wrong.
Don’t use theAuthorization header to send the token to the server, as thecookies handle the transmission for you, automatically.
Avoid cross-domain architectures if possible, to prevent the headache ofimplementing CORS responses on your server.

With this proposed cookie-based storage scheme for the tokens, your serverauthentication flow will look like this:

Is there an access token cookie?
- No? Reject the request.
- Yes?
  - Was it signed by me, and not expired?
    - Yes? Allow the request.
    - No? Try to get a new access token, using the refresh token.
      - Did that work?
        Yes? Allow the request, send new access token on response as cookie.
        No? Reject the request, delete refresh token cookie.

AngularJS Patterns for Authentication

Because we are using cookies to store and transmit out tokens, we can focus onauthentication and authorization at a higher level. In this section we’ll coverthe main stories you need to implement, and some suggestions of how to do thisin an Angular (1.x) way.

How Do I Know If the User Is Logged In?

Because our cookies are hidden from the JavaScript environment, via theHttpOnly flag, we can’t use the existence of that cookie to know if we arelogged in or not. We need to make a request of our server, to an endpoint thatrequires authentication, to know if we are logged in or not.

As such, you should implement an authenticated /me route which will return theuser information that your Angular application requires. If they are not loggedin, the endpoint should return 401 Unauthorized.

Requesting that endpoint should be the very first thing that your applicationdoes. You can then share the result of that operation a few ways:

With a Promise. Write an $auth service, and have a method$auth.getUser(). This is just a simple wrapper around the $http call tothe /me endpoint. This should return a promise which returns the cachedresult of requesting the /me endpoint. A 401 response should cause thepromise to be rejected.
Maintain a user object on root scope. Create a property$rootScope.user, and have your $auth service maintain it like so:
- null means we are resolving user state by requesting /me.
- false means we saw a 401, the user is not logged in.
- {} assign the user object if you got a logged-in response from the /meendpoint.
Emit an event. Emit an $authenticated event when the /me endpointreturns a logged in response, and emit the user data with this event.

Which of these options you implement is up to you, but I like to implement allof them. The promise is my favorite, because it allows you to add “loginrequired” configuration to your router states, using the resolve features ofngRoute or uiRouter. For example:

angular.module(myapp)  .config(function($stateProvider) {      $stateProvider        .state(home, {          url: /,          templateUrl: views/home.html,          resolve: {            user: function($auth) {              return $auth.getUser();            }          }        });    });

If the user promise is rejected, you can catch the $stateChangeError andredirect the user to the login page.

How Do I Know If the User Can Access a View?

Because we aren’t storing any information in JavaScript-accessible cookies orlocal storage, we have to rely on the user data that comes back from the /meroute. But with promises, this is very easy to achieve. Simply chain a promiseoff $auth.getUser() and make an assertion about the user data:

$stateProvider  .state(home, {    url: /admin,    templateUrl: views/admin-console.html,    resolve: {      user: function($auth) {        return $auth.getUser()          .then(function(user){            return user.isAdmin === true;          })      }  }});

How Do I Know When Access Has Been Revoked?

Again, we don’t know when the cookies expire because we can’t touch them :)

As such, you have to rely on an API request to your server. Using an $httpinterceptor, if you see a 401 response from any endpoint, other than the /meroute, emit a $unauthenticated event. Subscribe to this event, and redirectthe user to the login view.

Recap

I hope this information has been useful! Go forth and secure your cookies :)As a quick recap:

JWTs help with authentication and authorization architecture.
They are NOT a “security” add-on. They don’t give more security be default.
They’re more magical than an opaque session ID.
Store JWTs securely!

If you’re interested in learning more about JWTs, check out my article on Token-Based Authentication for Single Page Apps and Tom Abbott’s article on how to Use JWT The Right Way for a great overview.

How To Get Started with Android Development

by Sarang Nagmote

Category - Mobile Apps Development
More Information & Updates Available at: http://insightanalytics.co.in

Learning how to build a mobile application is a good project to improve your programming skills while learning to work in a different environment than the desktop or a web browser. You can get started without worrying about a large stack, making it easy for a beginner to pick it up and start playing with quickly.

Building applications with the Android SDK is self-contained if you stick with the standard libraries. You only need to download the package from Google containing all the tools and you’re ready to go. If you know object-oriented programming and how layouts are done for the web, many patterns and practices will feel familiar to you. The barrier to entry is low: all the tools are free, so you only need an Android device and your computer to get started.

Also, the open philosophy of Android means that you can do as you wish with your applications and your device. As long as you have an installer file (.apk), you can distribute your application to any device. It’s easy to send a copy of your application to your friends so they can test it out. This is great if you have a small project that you only want to deploy on a few machines such as a kiosk an art project. Once you have more experience, there are many open source libraries that improve on what is available in the SDK, and the open source community is active and welcoming.

The following will teach you how to get up and running with the samples included with the Android SDK.

Setting up an Android development environment

You don’t need much to get started with Android. Any decent PC, Mac or Linux box will do the job. All the tools are free, and you can install them as a single package from the Google Android Developers site at https://developer.android.com/sdk/.

There are two main tools you need to know about: the Android SDK Manager and the Android Studio IDE.

The Android SDK Manager is used to download the libraries, tools, system images and code samples for the platform (version) of the Android OS you want to develop for. By default, the package contains the latest version of the SDK Platform (6.0/API level 23 as I’m writing this). When a new version comes out or a new developer kit is available, you must download it using the SDK Manager.

The Android Studio IDE is where you’re going to spend most of your time. It is based on IntelliJ and includes a code editor, a layout editor and all the tools you need to compile your application and debug it on an emulator or on your Android device. Eclipse was previously available as an IDE so you’re going to see some references to it online, but Android Studio is now the official IDE for Android and everybody is now using it.

Installing the drivers for your tablet

By default, when you plug in an Android device in your computer, you’ll see the content of the external storage like you would see the content of a USB key. It’s enough to upload data, but if you want to debug an application on your device, you must install the drivers. You could develop applications using the emulator, but it’s a lot slower and it’s going to be hard to see if the touch interactions work as you intended.

If you have a Nexus device, the drivers are available from the Android SDK Manager. For other manufacturers like Samsung or ASUS, you can find the driver on their website. It’s not always clear what you should download, since the driver is often packaged with other software such as synchronization tools.

To be able to attach a debugger, you must also enable the debugging mode on your device by navigating to the About option in the Settings menu. Click 10 times on About and the Developers Options menu will appear, allowing you to set the debugging mode.

Running the sample projects from the Android SDK

The language used with the Android SDK is Java, but Android has its own virtual machine, Dalvik, that uses a subset of the standard Java classes. Since it’s such a small subset of Java, you don’t need experience in Java to get started. If you have a good basis in any object-oriented language, you should be able to pick it up pretty fast. The rest of the files, such as language files and layouts, are in XML.

To get started quickly, I’m going to show you how to run one of the sample projects. Those samples are from Google and they are a good starting point to learn what you can do in an Android application and how to do it. Google regularly adds new samples as new APIs becomes available, so check them before trying to do something.

When you start Android Studio for the first time, you’ll see the following screen. To get started running an application right now, just select Import an Android code sample to create a project.

In the following screen, choose the Borderless button example and click Next to create the project:

Once the project is loaded, select Debug… from the Run menu to launch the application in debugging mode on your device. A Device Chooser window will pop up, allowing you to select your device (if the driver has been properly installed) or to launch the emulator. Press OK and the sample will run on your device.

You can now play with the sample and add breakpoints in the source code (located in the Applicationsrc folder) to see how it behaves. For example, you can put a breakpoint in the onCreate method of the MainActivity.js file, and try to understand how it behaves when you rotate your device.

How to Write Better QA Tests

by Sarang Nagmote

Category - Developer
More Information & Updates Available at: http://insightanalytics.co.in

Today, we’re sharing our approach to test writing and how you can use it to get better results from your QA tests.

Crafting well-written test cases is critical to getting reliable, fast results from your manual QA tests. But learning to write better QA tests can take a bit of practice. That’s why when we onboard new customers, each one gets paired with a customer success manager who gives them a crash course in writing kickass tests.

Whether you use Rainforest or not, the QA test writing strategies that we teach our users provide a solid foundation for writing QA tests that get reliable results.

Better Quality Products Start With Better QA Tests

Clearly written, singularly-focused tests make your testing process run a lot more smoothly. You’ll be better able to communicate your expectations to your testers, leading to more deterministic feedback. And your tests can be executed more quickly – and scale up easily – because there’s no confusion about what needs to be done for any given test case.

Trouble with test writing can also be a good indicator of larger problems with your product or interface. Engineer Max Spankie at ConsumerAffairs discovered that the hyper-focused format of Rainforest test cases serves as its own smoke test for quality:

We realized that if it takes too much effort to write a test that testers will understand, that’s a red flag for us about the usability of that feature.

As any developer knows, writing good tests forces you to think through exactly how you want something to work. This can have a real impact on the quality of your product, and keeps your team from wasting time chasing down rabbit-holes when what the product really needs is a holistic overhaul.

What Does a Good QA Test Look Like?

To achieve the end goal of clear, repeatable test results, test cases in the Rainforest platform are written in a simple action-question format.

Rainforest tests allow you to confirm at-a-glance that key processes work as intended through a simple series of instructions which we call steps. Each step consists of an action and a question about the results of that action. The action must be a simple and complete, and the question should have a “yes” or “no” answer. The format is designed to make QA test results as clear and unambiguous as possible.

One of our customer success managers, Lita, gives the following insight on why writing tests in this format produce great results:

It’s best to direct testers through unambiguous steps that can be immediately associated with success or failure by your team. This takes the burden of interpretation off the testers, while providing crystal clear answers about whether your process worked. The great thing about this approach to writing tests is that you complete control the performance of the testers, whether or not they’re familiar with the product or feature being tested. By distilling a test case down to the essential actions required to complete a process, and tracking success and failure along the way, the action-question method of test writing helps pinpoint bugs quickly and precisely.

Finding the Right Test Scope

Another important aspect of writing good QA tests is to make sure that the scope of your tests is aligned with the feedback you care about.

Oftentimes when our customers first start writing tests in Rainforest, they try to cram too much into a single test case, or into a single action-question pair.

This is part of the reason why we require questions to be formatted for a “yes” or “no” answer in Rainforest. By constraining the results of each step in the test to a simple “it worked” or “it didn’t work,”

Ideally, the scope of a single test case should represent a complete (but simple) user activity. For example, a common test case in Rainforest is to create a new user profile. Another test case might be to log in to an existing account. While testing these two activities in one long test case might seem like it will save time, ultimately each set of actions represents a different activity, and should be separated into two test cases.

Troubleshooting Bad QA Tests

In most cases, a well-written test should be easy to execute, even for someone who doesn’t have any special knowledge of the product. Rainforest customers have this benefit baked into their manual testing flows, since our network of testers can flag poorly-written tests as confusing. You can get some of this benefit at your organization by having a third-party review your tests for clarity, although this is a much slower, less scalable alternative.

Share Your QA Test Writing Strategy!

Writing better tests is one of the most significant ways that you can improve your QA results without investing more time, energy and resources into your workflow. Do you have a test-writing strategy that works for you? Tweet @rainforestqa and let us know!

MySQL Document Store Developments

by Sarang Nagmote

Category - Databases
More Information & Updates Available at: http://insightanalytics.co.in

This blog post will discuss some recent developments with MySQL document store.

Starting MySQL 5.7.12, MySQL can be used as a real document store. This is great news!

In this blog post, I am going to look into the history-making MySQL work better for "NoSQL" workloads and more of the details on what MySQL document store offers at this point.

First, the idea of using reliable and high-performance MySQL storage engines for storing or accessing non-relational data through SQL is not new.

Previous Efforts

MyCached (Memcache protocol support for MySQL) was published back in 2009. In 2010 we got HandlerSocket plugin, providing better performance and a more powerful interface. 2011 introduced both MySQL Cluster (NDB) support for MemcacheD protocol and MemcacheD access to InnoDB tables as part of MySQL 5.6.

Those efforts were good, but focused a rear-window view. They provided a basic (though high-performance) Key-Value interface, but many developers needed both the flexibility of unstructured data and the richness inherent in structured data (as seen in document store engines like MongoDB).

When the MySQL team understood the needs, MySQL 5.7 (the next GA after 5.6) shipped with excellent features like JSON documents support, allowing you to mix structured and unstructured data in the same applications. This support includes indexes on the JSON field as well as an easy reference "inside" the document fields from applications.

MariaDB 5.3 attempted to support JSON functionality with dynamic columns. More JSON functions were added in MariaDB 10.1, but both these implementations were not as well done or as integrated as in MySQL 5.7—they have a rushed feel to them. The plan is for MariaDB 10.2 to catch up with MySQL 5.7.

JSON in SQL databases is still a work in progress, and there is no official standard yet. As of right now different DBMSs implement it differently, and we’ve yet to see how a standard MySQL implementation will look.

MySQL as a Document Store

Just as we thought we would have wait for MySQL 5.8 for future "NoSQL" improvements, the MySQL team surprised us by releasing MySQL 5.7.12 with a new "X Plugin." This plugin allows us to use MySQL as a document store and avoid using SQL when a different protocol would be a better fit.

Time will tell whether the stability and performance of this very new plugin are any good—but it’s definitely a step in the right direction!

Unlike Microsoft DocumentDB, the MySQL team choose not to support the MongoDB protocol at this time. Their protocol, however, looks substantially inspired by MongoDB and other document store databases. There are benefits and drawbacks to this approach. On the plus side, going with your own syntax and protocol allows you to support a wealth of built-in MySQL functions or transactions that are not part of the MongoDB protocol. On the other hand, it also means you can’t just point your MongoDB application to MySQL and have it work.

In reality, protocol level compatibility at this level usually ends up working only for relatively simple applications. Complex applications often end up relying on not-well-documented side effects or specific performance properties, requiring some application changes anyway.

The great thing about MySQL document store is that it supports transactions from the session start. This is important for users who want to use document-based API, but don’t want to give up the safety of data consistency and ACID transactions.

The new MySQL 5.7 shell provides a convenient command line interface for working with document objects and supports scripting with SQL, JavaScript, and Python.

The overall upshot of this effort is that developers familiar with MySQL, who also need document store functionality, will be able to continue using MySQL instead of adding MongoDB (or some other document store database) to the mix in their environments.

Make no mistake, though: this is an early effort in the MySQL ecosystem! MongoDB and other companies have had a head start of years! Their APIs are richer (in places), supported by more products and frameworks, better documented, and more understood by the community in general... and are typically more mature.

The big question is when will the MySQL team be able to focus their efforts on making document-based APIs a "first-class citizen" in the MySQL ecosystem? As an example, they need to ensure stable drivers exist for a wide variety of languages (currently, the choice is pretty limited).

It would also be great to see MySQL go further by taking on other areas that drive the adoption of NoSQL systems—such as the easy way they achieve high availability and scale. MySQL’s replication and manual sharding were great in the early 2000s, but is well behind modern ease-of-use and dynamic scalability requirements.

Want to learn more about this exciting new development in MySQL 5.7? Join us at Percona Live! Jan Kneschke, Alfredo Kojima, Mike Frank will provide an overview of MySQL document store as well as share internal implementation details.

Big Jobs, Little Jobs

by Sarang Nagmote

Category - Data Analysis
More Information & Updates Available at: http://insightanalytics.co.in

You’ve probably heard the well-known Hadoop paradox that even on the biggest clusters, most jobs are small, and the monster jobs that Hadoop is designed for are actually the exception.

This is true, but it’s not the whole story. It isn’t easy to find detailed numbers on how clusters are used in the wild, but I recently came across some decent data on a 2011 production analytics cluster at Microsoft. Technology years are like dog years, but the processing load it describes remains representative of the general state of things today, and back-of-the-envelope analysis of the data presented in the article yields some interesting insights.

A graph from this article shown below illustrates the job size distribution for one month of processing. The total number of jobs for the month was 174,000, which is equivalent to a job starting every four seconds, round the clock—still a respectable workload today.

The authors make an interesting case for building Hadoop clusters with fewer, more powerful, memory-heavy machines rather than a larger number of commodity nodes. In a nutshell, their claim is that in the years since Hadoop was designed, machines have gotten so powerful, and memory so cheap, that the majority of big data jobs can now run within a single high-end server. They ask if it would not make more sense to spend the hardware budget on servers that can run the numerous smaller jobs locally, i.e., without the overhead of distribution, and reserve distributed processing for the relatively rare jumbo jobs.

Variations of this argument are made frequently, but I’m not writing so much to debunk that idea, as to talk about how misleading the premise that “most Hadoop jobs are small” can be.

Several things are immediately obvious from the graph.

The X-axis gives job size, and the Y-axis gives the cumulative fraction of jobs that are smaller than x.
The curve is above zero at 100KB and does not quite hit 1.0 until x is at or near 1PB, so the job size range is roughly 100KB to 1PB, with at least one job at or near each extreme.
The median job size, i.e., the size for which one-half of the jobs are bigger and one-half are smaller (y=0.5) corresponds to a dataset of only 10GB, which is peanuts nowadays, even without Hadoop.
Almost 80% of jobs are smaller than 1TB.

One glance at this graph and you think, “Wow—maybe we are engineering for the wrong case,” but before getting too convinced, note that the X-axis is represented in log scale. It almost has to be, because were it not, everything up to at least the 1TB mark would have to be squeezed into a single pixel width! Or to put it the other way around, if you scaled the X-axis to the 1GB mark, you’d need a monitor about fifteen miles wide. It is natural to conflate “job count” and “amount of work done” and this is where the log scale can trick you because the largest job processes about 10,000,000,000 times as much data as the smallest job and 100,000 times as much data as the median job.

When you process petabytes, a single terabyte doesn’t sound like much, but even in 2016, 1TB is a lot of data for a single server. Simply reading 1TB in parallel from 20 local data disks at 50MB/sec/disk would take almost 17 minutes even if other processing were negligible. Most 1TB jobs would actually take several times longer because of the time required for such other tasks as decompression, un-marshalling and processing the data, writing to scratch disks for the shuffle-sort, sorting and moving the mapped data from scratch disks to the reducers, reading and processing the reducer input, and ultimately writing the output to HDFS (don’t forget replication.) For many common job types such as ETL and sorting, each of these tasks deals with every row, multiplying the total disk and IPC I/O volume to several times the amount of data read.

An Alternate View

If it takes 17 server/minutes simply to read a terabyte of data for a map-only job with negligible processing and output, it’s a stretch to say that it’s reasonable to run 1TB jobs without distribution, but even so, let’s arbitrarily fix 1TB our upper bound for processing on a single server.

I sketched an inverted view of the same numbers on the original graph. The line is derived by multiplying the change in the number of jobs by the job size. You can’t tell exactly how many of the biggest jobs there are, so I assumed there is only one, which is the weakest assumption.

The data read for 10GB and smaller jobs turns out to be negligible and the cumulative data processed of up to 1TB (the smallest 80% of jobs) is only about 0.01 of the total, which barely clears the y=0 line.

Jobs of up to 10TB in size, which comprises 92% of all jobs, still account for only about 0.08 of the data.

The other 92% of the cluster resources are used by the jobs in the 10TB to 1PB range—big data by any definition.

Consistently with this, “median” now means the job size for which half of the containers are for smaller jobs and half are for larger jobs. Defined this way, the median job size is somewhere in the neighborhood of 100TB, which is four orders of magnitude larger than the median for job count.

Far from supporting the intuition that most of the work is in the form of small jobs, the inverted graph is telling you that at least 99% of the resources are consumed by big jobs, and of that 99%, most is for huge jobs.

Responsiveness and Throughput

Does this mean that there’s nothing useful in knowing the distribution of job sizes, as opposed to resource consumption? Of course not—you just have to be careful what lessons you draw. Cumulative job count and cumulative data processed give very different pictures, but both are important.

The most important thing that the original version of the graph tells us is the critical importance of responsiveness, i.e., how quickly and predictably jobs execute. Big jobs may account for 99% of processing but the user experience is all in the small jobs—responsiveness is huge. Humans don’t really care at a gut-level whether an 8-hour, 1PB job takes an hour or two more or less, but they care a great deal if a Hive query moves from the finger-drumming time scale to the single-second range, which might be a latency difference of only a few seconds. They are particularly annoyed if the same query takes three seconds on one run and 30 seconds on another.

What the inverted version of the graph tells you is that the real work is all in the big jobs, where throughput—sheer gigabytes per second—is what matters.

So, what does that tell you about big expensive machines? Here’s how it breaks down.

The original motivation for running jobs on a single machine that was cited in the article was increasing efficiency by avoiding the overhead of distribution. However, while a high-end machine can get excellent throughput per-core, the real goal is responsiveness, which is inherently difficult to achieve using a single machine because it requires serializing work that could be done in parallel. Say your killer machine has 32 hardware cores and tons of memory. Even a puny 10GB job still requires 80 map tasks, which means that at least some cores will have to run at least three mappers serially even if the cluster is otherwise idle. Even for tiny jobs, the killer machine would have to be three times as powerful to achieve lower latency, even on an idle machine. Moreover, responsiveness is also about minimizing the longest latency. Unless the cluster is idle, the variance in tasks/core will be very high on a single machine. With distributed processing, the work can be dynamically assigned to the least-busy nodes, tightening the distribution of tasks-per-core. By the same token, a 1TB job would require at least 250 mappers running serially on each core. If a job is distributed over 100 16-core machines, each core would only have to run five mappers, and workers can be chosen to equalize the load.

For the big jobs, where total throughput, not responsiveness, is the issue, high-end machines are a poor bargain. High-end hardware provides more CPU and disk but you get less of each per dollar.

When thinking about Hadoop, you’ll rarely go wrong if you keep in mind that Hadoop jobs are usually I/O bound for non-trivial tasks. Even if, strictly speaking, a particular MR or Tez job is CPU bound, in most cases, a large part of that CPU activity will still be directly in service of I/O: compression and decompression, marshaling and un-marshaling of bulk data, SerDe execution, etc. While high-end machines are best in terms of throughput per core and per disk, those virtues count more for conventional processing.

Conclusion

Hadoop is all about parallelism, not only because it’s critical for big jobs, but because it’s the key to responsiveness for smaller jobs. The virtues of high-end machines don’t count for much in the land of the elephants.

Thursday, 28 April 2016

WhitestormJS: Web Game Development Made Easy

by Sarang Nagmote

Category - Website Development
More Information & Updates Available at: http://insightanalytics.co.in

We just released a new version, big version.

Have you ever try to develop a 3D web-based game? I’m pretty sure you have, but you also found that it wasn’t easy at all. You have so many things to get your project working. Shapes, physics, textures, materials, lighting, camera position, controls, and well of course we want all that to be made in a fast and efficient way. And that’s just what WhitestormJS is.

WhitestormJS is a javascript game engine. It wraps physics, lighting, surfaces, textures in a yet simple and powerful API.

3D in our browsers

Since we all know how big the web is, there are several underused technologies, such as WebGL. But the idea of creating amazing 3D experiences is something that has been there for a while, and now it’s faster and easier than ever.

The cool part

Let’s take a look at the problem:

We want an object to be rendered with the proper and smooth shape, but that will take some of our resources
We want to calculate physics in a proper way, but with a complex model that’s hard.

WhitestormJS uses a JSON-like structure for creating objects from the inputed data and adds them to 3d world. WhitestormJS solves this problem allowing you to set two different models for a single object. One, the complex one, to render the object itself and the second one to make its physics calculations.

Result — WhitestormJS uses different models to calculate a single object physics and shape

The cooler part

WhitestormJS uses Web Workers API, which allocates the rendering and physics calculation in different threads, and allows the browser to light speed refresh the animation frames.

The even cooler part

It has some built-in shapes and light classes that will help you to kick start your game development workflow.

What about controls?

You don’t really need to worry about controls, WhitestormJS has first-person controls and orbit controls built-in and it will take you just one line to set them (well probably two).

Extending WhitestormJS for 3D apps

If you find that you need something from WhitestormJS that is not built-in,WhitestormJS has a plugin system, which is easy and fast.

The idea is simple, you have 2 basic super classes called WHS.Shape andWHS.Light. Both of them have similar methods and attributes. All components in WhitestormJS are built with help of these two classes. If you want to use them — you should write your own class extended by one of them. You will automatically achieve all their functions for building and working with WHS object from inputed parameters.

Later you can change it’s location attributes with setPosition() andsetRotation() methods.

Going behind the storm

WhitestormJS is built with the fantastic ECMAScript6 and the github organization provides their own custom versions of ammo.js and PhysiJS

WhitestormJS just released its r8 version, and it’s now available as an npm package. And there are big plans for further development.

From the idea to the code

Everything in WhitestormJS is just a bridge for Three.js, simplified for crafting, but all functionality is retained.

How?

After you call WHS.Shape, it returns a WHS object that contains a meshobject, that is analog to your object in Three.js and you can choose what to work with.

Setting up workspace

WHS.World creates a Three.js scene wrapped by our custom Physi.js. The analog in Three.js is THREE.Scene.

It will set up a new with normal gravity and start animating it.

WHS.Box == ( WHS.Shape )

The first basic WHS object that we will add to our world will be Box. As you see in WhitestormJS a shape is made from the configuration object we pass.

var world = new WHS.World({    stats: "fps", // fps, ms, mb or false if not need.    gravity: { // Physic gravity.        x: 0,        y: -100,        z: 0    },    path_worker: physijs_worker.js, // Path to Physijs worker here.    path_ammo: ammo.js // Path to Ammo.js from Physijs worker.});// Define your scene objects here.world.start(); // Start animations and physics simulation.

We can apply options for the Three.js material object (THREE.Material) inmaterial parameter. You can pass any argument to the proper material configuration, remember that it still is a Three.js material object. The only one option in material that is not for THREE.Material is kind. This option is a type of material you will use. For instance, kind: “basic” will beTHREE.MeshBasicMaterial.

var box = world.Box({    geometry: { // THREE.BoxGeometry      width:20,       height:20,       depth:20    },   mass: 10, // Physijs mass.   material: {       color: 0xffffff,       kind: "basic" // THREE.MeshBasicMaterial   },   pos: { // mesh.position      x: 10,       y: 50,       z: 0    },   rot: {      x: Math.PI/2,      y: Math.PI/4      z: 0   }});box.setPosition( 0, 100, 0 ); // will be set as x:0, y:100, z:0box.setRotation( 0, Math.PI/6, 0 );box.clone().addTo( world );

What’s amazing about the code above is that with just one line you can apply a rotation (with the setRotation function) you can set the object current position ( with setPosition ) just by passing the arguments with the actual transformation needed.

You can also clone and add a new clone for that simple box with a single line.

Wrapping all up

You can see all of the features on the WhitestormJS website.

Contribute with new features and get involved with development on github.

And follow WhitestormJS creator Alexander Buzin on Twitter.

You can also play with our first person example now.

Technical Debt Shouldnt Be Handled Like Financial Debt

by Sarang Nagmote

Category - Developer
More Information & Updates Available at: http://insightanalytics.co.in

Like many software developers in the 21st century, I use the term “technical debt” in a negative way: its the ever-accumulating cruft in your system that stands in the way of adding new features. As technical debt increases, the work takes ever longer, until you reach a point where forward progress ceases.

This view of technical debt equates it to a credit card: unless you pay your balance in full each month, youre charged interest. If you only make the minimum payment, that interest accrues and it will take you years to pay off the card. If you make the minimum payment and keep charging more, you may never get out of debt. Eventually, after maxing out several cards, youll have to declare bankruptcy.

But thats a very puritanical view of debt, and its not a view shared by everyone.

For a person with a business-school background, debt is a tool: if you can float a bond at 5% to build a factory that gives you a 10% boost in income, then you should do that (usually — there are other factors to consider, such as maintenance and depreciation). More important, youre not going pay that bond off before its due; doing so would negate the reasons that you issued it in the first place.

Which means that the term “technical debt” probably doesnt have the same connotations to your business users as it does to you. In fact, using that term may be dangerous to the long-term prospects of your project. If you say “we can release early but well add a lot of technical debt to do so,” thats a no-brainer decision: of course youll take on the debt.

I think a better term is total cost of ownership (TCO): the amount you pay to implement features now, plus the amount you will pay to add new features in the future. For example, “we can release this version early, but well add three months to the schedule for the next version.”

Which may still mean that you cut corners to release early, and probably wont stave off demands to release the next version early as well. But at least youll be speaking the same language.

Sorted Pagination in Cassandra

by Sarang Nagmote

Category - Databases
More Information & Updates Available at: http://insightanalytics.co.in

Cassandra is a fantastic database for different use cases. There are different situations when you need to twist Cassandra a little and studying one of those could be a helpful exercise to better understand what is Cassandra about. Databases are complex beasts, approaching them with the right level of abstraction is vital. Their final goal is not storing data per se, but make that data accessible. Those read patterns will define which database is the best tool for the job.

Time Series in Cassandra

A time series is a collection of data related to some variable. Facebooks timeline would be a great example. A user will write a series of posts over time. The access patterns to that data will be something like return the 20 last posts of user 1234. The DDL of a table that models that query would be:

CREATE TABLE timeline (    user_id uuid,    post_id timeuuid,    content text,    PRIMARY KEY (user_id, post_id))WITH CLUSTERING ORDER BY (post_id DESC);

In Cassandra Primary Keys are formed by Partition Keys and Clustering Keys. Primary keys enforce the uniqueness of some cells in a different way to relational databases. There is no strong enforcement of that uniqueness, if you try to insert some cell related to an already existing primary key, that will be updated. Also, the other way around: a missing update will end up as insert. Thats called upsert.

Partition keys ensure in which node of the cluster the data is going to live. If you include at least one clustering key, the partition key will identify N rows. That could be confusing for someone coming from traditional relational databases. Cassandra does its best trying to bring its concepts into SQL terminology, but sometimes it could be weird for newbies. An example of Timeline table would be:

user_id--------------------------------post_id--------content346e896a-c6b4-4d4e-826d-a5a9eda50636---today----------Hi346e896a-c6b4-4d4e-826d-a5a9eda50636---yesterday------Hola346e896a-c6b4-4d4e-826d-a5a9eda50636---one week ago---Bye346e896a-c6b4-4d4e-826d-a5a9eda50636---two weeks ago--Ciao

In order to understand the example, I converted post_id values into something that makes sense for the reader. As you can see there are several values with the same partition key (user_id) and that works as we defined a clustering key (post_id) that clusters those values and sorts them (descending in this case). Remember that uniqueness is defined by the primary key (partition plus clustering key) so if we insert a row identified with 346e896a-c6b4-4d4e-826d-a5a9eda50636 and today the content will be updated. Nothing gets really updated in disk as Cassandra works with immutable structures in disk, but at read time different writes with the same primary key will be resolved in descending order.

Lets see some queries to finish this example:

SELECT * FROM timelinewhere user_id = 346e896a-c6b4-4d4e-826d-a5a9eda50636

-> It will return four rows sorted by post_id DESC

SELECT content FROM timelinewhere user_id = 346e896a-c6b4-4d4e-826d-a5a9eda50636 LIMIT 1

-> It will return Hi

SELECT content FROMtimeline where user_id = 346e896a-c6b4-4d4e-826d-a5a9eda50636 and post_id > today LIMIT 2

-> It will return Hola and Bye

As you can see implementing sorted pagination is extremely easy when modeling Time Series in Cassandra. Besides it will be super performant as Cassandra stores all the rows identified by a single partition key in the same node, so a single roundtrip will be needed to fetch this data (assuming read consistency level ONE)

Lets see what happens when we want to implement sorted pagination in a different use case.

Sorted Sets in Cassandra

If we think in the previous example at data structure abstraction level, we can see that we just modeled a Map whose values are Sorted Sets. What happens if we want to model something like a Sorted Set with Cassandra?

Our scenario is the following. The users of our system can be suspended or unsuspended through some admin portal. The admins would like to have a look into the last users that have been suspended along the suspensions reason in order to verify that decision or revoke it. Thats pretty similar to our previous paginated queries so lets how we can model that with Cassandra.

CREATE TABLE suspended_users ( user_id uuid, occurred_on timestamp, reason text)

Ive deliberately left out the Primary Key from this DDL so we can discuss different options.

Understanding Clustering Keys

Previously we used clustering keys to provide some order into our data. Lets go with that option:

PRIMARY KEY (user_id, occurred_on)

Can you see what is wrong with this? Forget about implementation details for a second and answer this question, how many times a user will appear in this table? As your self-elected product owner Ill say that only one. Once a user is unsuspended Id like to remove the user from that table and a user that is suspended cant be suspended again. Next question: where do we want to keep some order? Not inside users (even less in this case, as our single user will be always ordered), but amongst users. This design wont work.

Understanding Partition Keys and Partitioners

I have a new bit of information that might help you. This table will be updated in real time, so that means that this table should keep some kind of logical insertion order. As we didnt get into the details of Cassandra we could think that the following will work:

PRIMARY KEY (user_id)

Lets see how that logical insertion order maps into the physical one. Cassandra stores its data in a ring of nodes. Each node gets assigned one token (or several if we use vnodes). When you CRUD some data Cassandra will calculate where in the ring lives that data using a Partitioner that will hash the Partition Key. When using recommended partitioners Cassandra rows are ordered by the hash of their value and hence the order of rows is not meaningful, so that logical insertion order will be logical and nothing else. That means that this query will return 20 users without any meaningful order:

SELECT * FROM suspended_users LIMIT 20;

Using the token function we could paginate large sets of data as it was explained in here.

SELECT * FROM suspended_users where token(user_id) > token([Last user_id received]) LIMIT 20;

However, we want to paginate a sorted set by suspension time and descending.

Presenting Reverse Lookups

Denormalisation is something usual in Cassandra. In order to overcome restrictions imposed by Cassandra implementation, denormalising our data is a suggested approach. Thanks to our previous example we understood that to keep some order between data we need to cluster it. Nobody forces us to use a suspended_users table even if our domain talks about it. As we need some fixed variable to create a time series, well go with the status:

CREATE TABLE users_by_status (  status text,  occurred_on timestamp,  user_id uuid  reason text,  PRIMARY KEY (status, occurred_on, user_id)) WITH CLUSTERING ORDER BY (occurred_on DESC);

Partition and clustering keys can be compounded. In this particular key, status will be the partition key and occurred_on/user_id the clustering key. Default order is ASC, so thats why we specified occurred_on DESC inside of CLUSTERING ORDER BY. Its important to note that user_id will serve for uniqueness purposes in this design even if it will order rows in the unlikely case of two users being suspended at the very exact time.

Now that we created an artificial clustering, we can paginate in a sorted way like in our first example. This presents several problems, though. Cassandra wont split data inside of a row, and the recommended maximum size of rows inside of a partition is 200k. If you foresee that your system will grow more than that you can split the rows with the technique of compounds partitions keys using temporal buckets.

CREATE TABLE users_by_status (  bucket text,  status text,  occurred_on timestamp,  user_id uuid  reason text,  PRIMARY KEY ((bucket, status), occurred_on, user_id)) WITH CLUSTERING ORDER BY (occurred_on DESC);

Being the bucket something like MM-YYYY or whatever fine-grained precision that your data will suggest you. Here I present a new bit of CQL (Cassandra Query Language) that is compounded partition keys. As you can see whatever is inside of those nested parentheses will be the partition key.

Next issue is how we will delete or update users that need to be unsuspended. The admin could have the user_id and occured_on and that wouldnt be a problem as he could do a query like this:

DELETE FROM users_by_status WHERE status = SUSPENDED and occurred_on = ... and user_id = ...

Unfortunately, that admin could get a request from some privileged managers to unsuspend a user. The manager doesnt know when the suspension happened, they only know who is the user. That means that we cant access to the concrete row as we dont have occurred_on. Remember that to query in Cassandra you need to provide the whole partition key (otherwise Cassandra wont know in which node it has to go to fetch the data) and optional parts of the clustering key (but always from left to right).

In order to overcome this issue, we could create a secondary index in the user_id column. In relational databases, indexes allow us to query faster some data creating a denormalized structure. In Cassandra, those secondary indexes allow us to query by columns that otherwise will be impossible to use. However, theyre discouraged as theyre a great hit in performance, as they require several round trips into different nodes.

Next solution is creating our own secondary index manually in something called reverse lookup. Lets see how it looks:

CREATE TABLE suspended_users ( user_id uuid, occurred_on timestamp, PRIMARY KEY (user_id));

This table will serve us as reverse lookup. Just having the user_id well be able to access to occurred_on value and then well be able to query users_by_status table. This approach has some drawbacks. Whenever we insert or delete a user well have to go to two tables, but thats a fixed number. With a secondary index, we will have to go to N nodes in the worst case. So it goes from O(1) to O(N). Our code will be more complicated also, as well have to contact with two different tables.

That presents a more serious drawback that is eventual consistency and transactions in Cassandra. Transactions are not built in the core of Cassandra (there are concepts like lightweight transactions or batches, but those are inefficient too), so that means that our code needs to take care manually about transactions.

If we want to delete a user we should start from users_by_status table. If we start the other way around, and the second deletion fails, well be unable to delete in the future that row as weve deleted the reverse lookup entry. We can introduce the Saga pattern that basically defines a rollback nemesis in every single step of a programmatic transaction.

Conclusion

As you could see, something pretty straightforward in a relational database as querying paginated a set of sorted data, could be tricky in Cassandra as soon as we introduce some requirements. If your infrastructure allows it, you should use a polyglot persistence approach that uses the best tool for every use case. Anyway, Cassandra gives you enough flexibility to model data even when its not its best use case.

Working With AVRO and Parquet Files

by Sarang Nagmote

Category - Data Analysis
More Information & Updates Available at: http://insightanalytics.co.in

With significant research and help from Srinivasarao Daruna, Data Engineer at airisdata.com

See the GitHub Repo for source code.

Step 0. Prerequisites:

Java JDK 8
Scala 2.10
SBT 0.13
Maven 3
Spark 1.6.0
For Mac: (brew install sbt)
Preferred Operation System: Mac or CentOS
Optional: Hive / Hadoop Installation or VM
Optional: Wget, Curl, Git CLI, unzip, tar

For details on installation, see here: http://airisdata.com/scala-spark-resources-setup-learning/

Step 1: Clone this Repository into a directory (like c:/tools or /tools)

git clone https://github.com/airisdata/avroparquet.git

Step 1 - Alternate: You can download the Zip file from https://github.com/airisdata/avroparquet and unzip. It will name it avroparquet-master.

Step 2: Clone Parquet Map Reduce Tools (for Parquet Command Line Tools) Note: For this step you must have JDK 1.8 installed and in your path. Also you must have Maven 3.x installed and in your path.

git clone -b apache-parquet-1.8.0 https://github.com/apache/parquet-mr.gitcd parquet-mr cd parquet-tools mvn clean package -Plocal

Step 3: Copy the /target/parquet-tools-1.8.0.jar to a directory in your path

Step 4: Copy the meetup_parquet.parquet from the avroparquet.git repository to directory accessible from the parquet tools or the same directory.

Step 5: View the Binary Parquet File (meetup_parquet.parquet) using the parquet tools. This format works on Mac, you may need to set PATHs and change directory structure in Windows or Linux.

java -jar ./parquet-tools-1.8.0.jar cat meetup_parquet.parquet

Step 6: View the Schema for the Same Parquet File

java -jar ./parquet-tools-1.8.0.jar schema meetup_parquet.parquet

Step 7: Using AVRO Command Line Tools, download the AVRO tools.

You can either download with curl, wget, or directly from a browser using the link below:

wget http://apache.claz.org/avro/avro-1.8.0/java/avro-tools-1.8.0.jar

Step 8: Copy the avro-tools jar to your path or to your local directory.

Step 9: Copy an AVRO file to your local directory or an accessible directory from AVRO tools

    Download from here:  https://github.com/airisdata/avroparquet/blob/master/airisdata-meeetup/src/main/resources/avro_file.avrowget https://github.com/airisdata/avroparquet/blob/master/airisdata-meeetup/src/main/resources/avro_file.avro

Step 10:

java -jar avro-tools-1.8.0.jar tojson --pretty avro_file.avro

For more information see:

Step 11: Avro and Parquet Java Instructions. Go to the directory where you downloaded https://github.com/airisdata/avroparquet/tree/master/airisdata-meeetup.
If you download the ZIP from GitHub it will be in directory avroparquet-master/airisdata-meetup

cd avroparquet-master cd airisdata-meetup

Step 12: Use Maven to build the package

mvn clean package

Step 13: AVRO File Processing

java -cp ./target/avro-work-1.0-SNAPSHOT-jar-with-dependencies.jar com.airisdata.utils.StorageFormatUtils avro write src/main/resources/avro_file.avro src/main/resources/old_schema.avscjava -cp ./target/avro-work-1.0-SNAPSHOT-jar-with-dependencies.jar com.airisdata.utils.StorageFormatUtils avro read src/main/resources/avro_file.avrojava -cp ./target/avro-work-1.0-SNAPSHOT-jar-with-dependencies.jar com.airisdata.utils.StorageFormatUtils avro read src/main/resources/avro_file.avro src/main/resources/new_schema.avsccat src/main/resources/new_schema.avsc

Step 14: PARQUET File Processing

java -cp ./target/avro-work-1.0-SNAPSHOT-jar-with-dependencies.jar com.airisdata.utils.StorageFormatUtils parquet write src/main/resources/parquet_file.parquet src/main/resources/old_schema.avscjava -cp ./target/avro-work-1.0-SNAPSHOT-jar-with-dependencies.jar com.airisdata.utils.StorageFormatUtils parquet read src/main/resources/parquet_file.parquet

Step 15: Kafka Setup. Download Kafka. (or for Mac you can do brew install kafka)

curl -O https://www.apache.org/dyn/closer.cgi?path=/kafka/0.9.0.1/kafka_2.10-0.9.0.1.tgz

Step 16: Unzip/tar Kafka tar -xvf ./kafka_2.10-0.9.0.1

Step 17: Run Zookeeper bin/zookeeper-server-start.sh config/zookeeper.properties

Step 18: Run Kafka bin/kafka-server-start.sh config/server.properties

Step 19: From your download directory: https://github.com/airisdata/avroparquet/tree/master/

Step 20: You must have Scala and SBT installed and in your path. You need Scala 2.10, JDK 8, and SBT 0.13 You can install these via brew.

Step 21: Build the Scala/Spark Program. You must have Spark 1.6.0+ installed

cd storageformats_meetupsbt clean assembly

Step 22: Submit This jar to Spark to Run. You will need Spark installed and accessible in your path. (brew install spark or see previous meetups). Submit the Kafka Avro Producer. Spark-submit must be installed relevant to where you are.

spark-submit --class com.airisdata.streamingutils.ClickEmitter target/scala-2.10/storageformats_meetup-assembly-1.0.jar localhost:9092 test

Step 23: Submit Avro Consumer

spark-submit --class com.airisdata.streamingutils.KafkaAvroConsumer target/scala-2.10/storageformats_meetup-assembly-1.0.jar test 2

Step 24: View the Spark History server (if you are running that)

http://localhost:18080/

Monday, 25 April 2016

Angular 2 Coming to Java and Python: The First Multi-language Full Stack Platform?

by Sarang Nagmote

Category - Website Development
More Information & Updates Available at: http://insightanalytics.co.in

Angular 2 is getting near the final release, and the whole community is really excited about the possibilities that it will bring. But the latest announcement a couple of days ago about the likely final release in May included one important quote about the future of Angular:

With Angular 2, we’re really attacking it from a platform of capabilities standpoint... Our plan is to have versions that will work with many server-side technologies, from Java to Python.

Lets go through what this quote might mean in terms of using Angular for full stack development in multiple ecosystems, by going over the following topics:

Angular in non-JavaScript languages
Full Stack Angular in JavaScript - Angular Universal
Advantages of using Angular also on the server side
The Angular Universal Starter
Server side Angular - a nice to have, or a whole new way of thinking?
Conclusions

Angular in Non-JavaScript Languages

This mention of versions of Angular 2 in languages like Java or Python is not something completely new, as Angular 1 itself had a Java version for GWT (angulargwt), which would compile down to JavaScript.

Angular 2 itself is available in Dart, and its internally built in Typescript. But the possibility of making it available in other languages and platforms other than Dart is something else completely.

The possibility of using Angular on the server is something that is currently being worked on, via the Angular Universal project.

Full Stack Angular in JavaScript — Angular Universal

Angular Universal is a core Angular project for enabling the use of Angular 2 on a Node.js server for the purposes of server side rendering.

See this latest talk for more details on Angular Universal, and especially this episode of the Read The Source podcast, where we can see Angular Universal in action (including the router part).

Notice that in the first part of the AngularConnect talk we can see an example of how to use the Angular 2 dependency injection on the server. Its worth mentioning that this could be done also without Angular Universal, as a way to structure a server app into decoupled modules.

This is an example of the benefits of using the same technology across the whole stack: the same dependency injection container could be used on the client and on the server.

Overview of How Angular Universal Works

What Angular Universal provides is a view rendering engine for express, which is in everything similar to any other express rendering engine, such as the Jade or Mustache template engines.

To see this in action, this is a simplified version of what an express server rendering Angular 2 components looks like:

import {ng2engine} from angular2-universal-preview; let app = express();// config view engineapp.engine(.html, ng2engine);app.set(views, __dirname);app.set(view engine, html);// config the root route app.use(/, function(req, res) {  let url = req.originalUrl || /;  res.render(index, {    App,    providers: [...],    preboot: true  });});

Whats going here is the following:

we configure express to use as view rendering engine the Angular Universal engine
the engine is configured to render HTML files
the root route / is configured to return as a response the rendering of index.html with root component App

There is another important element going on, the Angular router is also active on the server side.

How Does Routing Work in Angular Universal?

The root component App contains router-specific annotations:

@RouteConfig([  { path: /, component: Home, name: Home },  { path: /home, component: Home, name: Home },  { path: /about, component: About, name: About }])export class App {   ...}

The meaning of these annotations is clear on the client side: navigation will update the HTML5 Browser history instead of triggering a full page refresh, creating the enhanced experience that single page apps are all about.

But on the server side, what does the router do? On the server, the router is configured differently than on the client:

import {ng2engine, NODE_LOCATION_PROVIDERS} from angular2-universal-preview;  res.render(index, {    App,    providers: [      ROUTER_PROVIDERS,      NODE_LOCATION_PROVIDERS,    ],    preboot: true  });

Notice that a server side location provider is passed to the application. The way that the Angular 2 router works on the server is that if the user gets sent a direct link to some route inside the application, for example http://yourdomain.com/someroute, the server side version of the router will then take the route path /someroute and use it to determine which component should be rendered.

The result is that the client gets served a fully rendered HTML page, but then the client side router takes over the application. The user will still have the single page experience with no further server side re-rendering being triggered.

Advantages of Using Angular Also on the Server Side

Server side rendering has become popular for example in the React community, as it allows product organizations to build single page applications that do not suffer from search engine indexability issues, and give the user a much-enhanced user experience.

This approach brings the best of both worlds:

the user gets served an HTML page that its immediately rendered with very little startup time
because most of the work was done on the server side, only a minimal amount of JavaScript needs to be transferred to the client to take over the page as a SPA, this further speeds up things
this is ideal for mobile devices, where we want to avoid serving a large amount of JavaScript over a constrained network
the page is easily indexable by any search engine

Lets now see all this works together in practice, and then how all this links back to the announcement of Java and Python versions.

The Angular Universal Starter

The Angular Universal project provides a separate repository to quickly start a new project: the universal-starter. Lets install and start it:

npm installnpm start

Now lets load a page, and inspect the HTML that came over the wire:

We can see that the server parsed the component tree and sent back the HTML with only a relatively small script at the bottom of the page containing the registrations of the application browser event handlers.

We can also see that the HTML that came was produced in a node server, but that could have been produced by any other server technology: Java, Python, etc.

It would suffice that for that particular language to have a version of Angular Universal that works exactly the same way as the Node version. This would allow Angular to be used instead of traditional server-side templating engines.

Conclusions

We can see that using Angular in both the client and the server brings several important advantages besides uniformity and the fact that there are less technologies to learn.

Its not only about solving SEO issues for single page apps, using Angular on the server allows an enhanced user experience and provides the same app the ability to work across a wider range of devices.

Although nothing is concrete yet and we dont have much details, its probable that Angular will be ported to other languages together with its server-side side counterpart Angular Universal.

Application code written using those ports will then be compiled to JavaScript using technologies like the GWT compiler or PythonJS.

It would be possible to port only the framework, but the benefits of a full stack approach and server side rendering are really important.

By the looks of it, its very well possible that Angular could in time become the first multi-language full stack development platform.