Hadoop Ecosystem Evolves-10 Cool Big Data Projects - Maching Learning

10 Cool Big Data Projects

请参照原文，链接

Hadoop

Hadoop is really the flagship technology for open source big data. It grew out of a side project at Yahoo when developers needed a way to store and process the massive amount of data they collected with their new search engine. The technology was eventually contributed to the Apache Software Foundation. Today there are three major distributions from commercial companies – Cloudera, Hortonworks, and MapR. One of Hadoop’s creators, Doug Cutting, recently spoke with InformationWeek about the growth of his baby. We also recently put together a look at Hadoop’s history.

Hive

Apache Hive was initially developed by Facebook and contributed to the Apache Software Foundation. The technology is a data warehouse infrastructure built on top of Hadoop to provide data summarization, query, and analysis.

Companies using Hive include CNET and eHarmony.

HBase

Apache HBase grew out of a project at a company called Powerset, which was acquired by Microsoft in 2008. The goal was to process massive amounts of data for natural language search. The technology is an open source, non-relational, distributed database that is modeled after Google’s BigTable and written in Java. HBase became an Apache Software Foundation project in 2010.

Companies using HBase today include Adobe, Facebook, Meetup, and Trend Micro.

(Image: The Apache Software Foundation)

Spark

Apache Spark is the rising star of the big data ecosystem. The technology was originally developed at the AMPLab at the University of California, Berkeley. It can be used as a faster alternative to Hadoop’s MapReduce because Spark uses in-memory instead, producing performance that can be up to 100 times faster, depending upon the application.

Spark’s developers now work at Databricks, which provides major support to the project within the Apache Software Foundation, and also offers a commercial Spark-as-a-Service. As of the end of 2015, Spark was the most active open source project in all of big data, with more than 600 contributors in the previous 12 months.

Many companies are using Spark today, including Amazon, Autodesk, eBay, Groupon, OpenTable, and TripAdvisor.

Kafka

Apache Kafka was originally developed as a project within LinkedIn as a messaging system for brokering the massive quantity of real-time data generated and processed by the company’s consumer-facing careers website and platform.

Kafka was donated to open source in 2011 and graduated from the Apache Incubator program in 2012. The LinkedIn developers who created Kafka became part of a new company spun out of LinkedIn called Confluent.

Kafka is used by LinkedIn, Twitter, Netflix, Pinterest, Goldman Sachs, and Coursera.

(Image: The Apache Software Foundation)

Storm

Apache Storm is described on its project page as a distributed real-time computation system that makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing.

The technology is sometimes described as an alternative to Spark. BackType, the company that developed Storm, was acquired by Twitter in 2011. Storm became a Top-Level Project at the Apache Software Foundation in 2014 after graduating from the Incubator.

Twitter has since developed its own in-house system for handling the tasks originally assigned to Storm. Companies using Storm include Yahoo and Spotify.

Nifi

Apache Nifi, originally called Niagara Files, is a technology transfer project developed by the US National Security Agency (NSA) and spun out to the Apache Software Foundation as an incubator project in November 2014. It became a Top-Level Project in 2015.

Nifi tackles the problem of how to automate the flow of data between systems. Its project page at the Apache Software Foundation says the technology “supports powerful scalable directed graphs of data routing, transformation, and system mediation logic.”

It provides a Web-based user interface. And, as you might expect from an NSA-created project, it offers security features including SSL, SSH, HTTPS, encrypted content, and pluggable, role-based authentication and authorization.

Flink

The Apache Foundation accepted Apache Flink as a Top-Level Project in January 2015. The technology is a distributed data analysis engine for batch and streaming data that offers programming APIs in Java and Scala.

The project was born out of the Stratosphere research project in Berlin. Organizations using Flink include Capital One and Data Artisans.

Arrow

Apache Arrow was accepted as a Top-Level Project by the Apache Software Foundation this month. The technology comes out of the company Dremio, which has also contributed the Apache Drill project. Dremio’s founders came out of MapR, an Apache Hadoop distribution company.

Arrow was initially seeded by code from the Apache Drill project, according to the Apache Software Foundation. Arrow provides columnar in-memory analytics, according to Dremio co-founder and CTO Jacques Nadeau.

More Big Data Projects At ASF

These are some of the highlights of the big data projects in the Hadoop ecosystem at the Apache Software Foundation. Many others have been donated. Development is ongoing for all these projects, which are fully documented at the Apache Software Foundation website.

“The Apache Way is community over code,” Connolly told InformationWeek. “While technology is interesting, the Apache Way is about the community first. You check your [company’s] badge at the door.”

10 Cool Big Data Projects

CATALOG

FEATURED TAGS

FRIENDS