Teradata and Hortonworks Partnership and What It Means
Teradata sells software, hardware, and services for data warehouses and analytic applications. Part of the Teradata portfolio is also the Teradata Aster MapReduce Platform a massively parallel processing infrastructure with a software solution that embeds both SQL and MapReduce analytic processing for deeper analytic insights on multi-structured data and new analytic capabilities driven by data science.
Hortonworks offers services around the 100% Apache-licensed, open source Hortonworks Data Platform, an integrated solution built around Hadoop.

The interesting bits from the announcement and media coverage:
Teradata and Hortonworks will join forces to provide technologies and strategic guidance to help businesses build integrated, transparent, enterprise-class big data analytic solutions that leverage Apache Hadoop. The partnership will focus on enabling businesses to use Apache Hadoop to harness the value from new sources of data. Businesses will be able to quickly load and refine multi-structured data, some of which is being discarded today, for discovery and analytics. The resulting insights will enable analysts and front line users to make the best business decision possible.

For example, each day websites generate many terabytes of raw, complex data about customers’ viewing and buying habits. These web logs can be directly loaded into Teradata Aster or Apache Hadoop where they can be stored, transformed, and refined in preparation for analysis by the Teradata Aster MapReduce platform (nb: my emphasis).
The company [Teradata] has already worked with Hortonworks’ competitor Cloudera on a connector between the Teradata Database and Cloudera’s Hadoop distribution, but the Hortonworks deal appears a little deeper and more strategic.
The alliance between Teradata and Hortonworks means that companies can get strategic advice about how to get into the new analytics game from Teradata, and have practical help on running the systems from Hortonworks.
However, there are two important challenges that need to be addressed before broad enterprise adoption can occur:
- Understanding the right use cases in which to utilize Apache Hadoop.
- Integrating Apache Hadoop with existing data architectures in an appropriate manner to get better value from existing investments.
My sense of excitement about the Teradata/Hortonworks partnership is amplified by the fact that it addresses these two core challenges for Apache Hadoop:
- We will be rolling out a reference architecture that provides guidance to enterprises that want to understand the best use cases for which to apply Hadoop. As part of that, we will be helping Teradata customers use Hadoop in conjunction with their Teradata and Teradata Aster analytic data solutions investments.
- We will also be working closely with the Teradata engineering teams on jointly engineered solutions that optimize the integration points with Apache Hadoop.
-
From Hortonworks perspective this deal is weaker than the Oracle-Cloudera deal.
In the former case, new Teradata sales do not necessary result in new Hortonworks Data Platform installations, while in the case of the Oracle-Cloudera partnership, every sale results in a new business for Cloudera.
-
From Teradata perspective, this partnership gives them a perfect answer and solution for clients asking about unstructured data scenarios.
-
The announcement is slightly positioning Hadoop as part of ETL process, but is not as strict about this as other Hadoop integration architectures—see Netezza and Hadoop and Vertica and Hadoop.
-
Depending on the level of integration the two team will pull together, this partnership might result in one of the most complete and powerful structured and unstructured data warehouse and analytics platform.
I’m looking forward to seeing the proposed architecture blueprint once it’s finalized.
- terradata.com: Teradata-Hortonworks Partnership to Accelerate Business Value from Big Data Technologies
- hortonworks.com: The Importance of the Teradata & Hortonworks Partnership
- The Data Blog: Aster Data Blog » Blog Archive » Perspectives on Teradata-Hortonworks Partnership
- Bits NYTimes.com:Teradata and Hortonworks Join Forces for a Big Data Boost
- GigaOM: Teradata taps Hortonworks to improve Hadoop story
- ServicesANGLE: Hortonworks Announces Partnership with Teradata
Original title and link: Teradata and Hortonworks Partnership and What It Means (NoSQL database©myNoSQL)
Quick Guide to MongoDB and Python With PyMongo
A tutorial on PyMongo from Rick Copeland covering:
- configuration options for MongoDB
- documents structure, inserts and batch inserts
- querying and indexing
- deleting
- updating
One thing that’s nice about the pymongo connection is that it’s automatically pooled. What this means is that pymongo maintains a pool of connections to the mongodb server that it reuses over the lifetime of your application. This is good for performance since it means pymongo doesn’t need to go through the overhead of establishing a connection each time it does an operation. Mostly, this happens automatically. you do, however, need to be aware of the connection pooling, however, since you need to manually notify pymongo that you’re “done” with a connection in the pool so it can be reused.
Original title and link: Quick Guide to MongoDB and Python With PyMongo (NoSQL database©myNoSQL)
The Navvy Big Data Story: From Netcentricity to Every Platform as a Sensor
One of the Navy’s first steps toward the information dominance idea was to start thinking of all of its ships, airplanes and other platforms as data-collecting sensors.
“So we’ve gone from netcentricity, to every platform as a sensor, to netting every sensor, and the point of all of it really is not just to have netcentricity but to enable good decisions based on the data that comes off of those sensors,” said Rear Adm. Jan Tighe.
A captivating story of freeing data from silos and attempting to gain insights from it in the new form.
Original title and link: The Navvy Big Data Story: From Netcentricity to Every Platform as a Sensor (NoSQL database©myNoSQL)
Objectivity CEO: We Have Been Solving the Big Data Problem
Jay Jarell, the President and CEO of Objectivity, in a PR announcement:
We have been solving the Big Data problem for decades.
For decades!
Original title and link: Objectivity CEO: We Have Been Solving the Big Data Problem (NoSQL database©myNoSQL)
One Side of the Hadoop Adoption Story
Ron Bodkin1 in the Hadoop and NoSQL in a Big Data Environment interview:
But then the next thing that happens is once people have started doing that level of processing they realize there is a power of being able to ask questions they never thought of before the data, they can store all the data in small samples and they can go back and have a powerful query engine, a cluster of commodity machines that lets them dig into that raw data and analyze it new ways ultimately leading to data science being able to do machine learning and being able to discover patterns in data and keep them improving and refining the data.
Arun Murthy2 quoted in PC Advisor’s Making the transition from RDBMS to Hadoop - PC Advisor:
“In the bottom-up method of deployment, usually there’s a couple of engineers who download and deploy Hadoop either on a single node or maybe a small cluster with four or five nodes,” Murthy explained.
What tends to happen next is a pattern that Murthy has seen many times. Staffers using the Hadoop cluster start to notice the value of the toolset. Perhaps other divisions of the company set up their own Hadoop clusters. Eventually, the value of Hadoop rises significantly and (thanks to the scalability of the underlying distributed filesystem), the separate Hadoop clusters are combined into a single large cluster with perhaps 50 or so nodes.
-
Ron Bodkin: Founder and CEO of Think Big Analytics ↩
-
Arun C. Murthy: Founder and Architect at Hortonworks, Hadoop PMC ↩
Original title and link: One Side of the Hadoop Adoption Story (NoSQL database©myNoSQL)
InfiniteGraph 2.1 Features Gremlin Support and a Plugin Framework
A new version of InfiniteGraph, the graph database from Objectivity, was announced today. This release features:
- a plugin framework: Two kinds of plugins are supported. A navigator plugin bundles components that assist in navigation queries, such as result qualifiers, path qualifiers, and guides. The Formatter plugin formats and outputs results of graph queries.
- enhanced IG Visualizer: The advanced Visualizer is now tightly integrated with InfiniteGraph’s Plugin Framework allowing indexing queries for edges, the Formatter plugin framework export GraphML and JSON (built-in) or other user defined plugin formats.
- support for Tinkerpop Blueprints and Gremlin: InfiniteGraph provides a clean integration with Blueprints that is well suited for applications that want to traverse and query graph databases using Gremlin
A bit more details can be found in the InfiniteGraph 2.1 release notes.
Original title and link: InfiniteGraph 2.1 Features Gremlin Support and a Plugin Framework (NoSQL database©myNoSQL)
Riak Performance of Link Walking vs MapReduce
If you are asked to compare (or you just wonder about) the performance of link walking and map-reduce in Riak keep in mind the following details of how the two mechanism are implemented:
The biggest difference I see is that the link-walk uses an Erlang function where your MapReduce query uses a Javascript function (link-walking is implemented as a MapReduce query internally).
Serializing/deserializing to JSON as well as contention for Javascript VMs likely accounts for the lost time.
My emphasis on Bryan Fink’s email from Riak’s mailing list.
Original title and link: Riak Performance of Link Walking vs MapReduce (NoSQL database©myNoSQL)
An Introduction to Scalding, the Scala and Cascading MapReduce Framework From Twitter
A fantastic guide to Twitter’s Scala and Cascading MapReduce framework Scalding from Edwin Chen1:
In 140: instead of forcing you to write raw map and reduce functions, Scalding allows you to write natural code like
// Create a histogram of tweet lengths.
tweets.map('tweet -> 'length) { tweet : String => tweet.size }.groupBy('length) { _.size }
Looking at the code samples, this looks a lot like Apache Pig. But the Scalding documentation compares it to Scrunch/Scoobi and points to the answers in this Quora thread:
The main difference between Scalding (and Cascading) and Scrunch/Scoobi is that Cascading has a record model where each element in your distributed list/table is a table with some named fields. This is nice because most common cases are to have a few primitive columns (ints, strings, etc…).
-
Edwin Chen is data scientist at Twitter ↩
Original title and link: An Introduction to Scalding, the Scala and Cascading MapReduce Framework From Twitter (NoSQL database©myNoSQL)
Quick Guide to Using MongoDB With Django
In this article, learn how to call MongoDB from Python (using MongoEngine), and integrate it into a Django project in lieu of the built-in ORM. A sample web interface for creating, reading, writing, and updating data to the MongoDB back end is included.
When I’m thinking of Django (and Rails) one of its major strengths is the ease of defining mappings for the application entities and the way these mappings integrate with both the persistence layer and presentation layer. But the document model is introducing a lot of different options and approaches when compared with the relational model (which is the default in Django). Support for simple mappings is a good start, but I don’t think MongoDB integration supports yet optional fields, or entity versioning, or custom aggregate entity mappings, or tunable queries.
Original title and link: Quick Guide to Using MongoDB With Django (NoSQL database©myNoSQL)
Amazon DynamoDB Is Not Production Ready
Timothy Cardenas reports on his experience with Amazon DynamoDB and the Ruby SDK:
Problems i have had include:
- Write capacity hanging in create mode for over an hour
- Inability to simply count my records
- Inablity to loop through records without huge read costs
- No asyncronous support for writting
- Can only double read/write capacity per update
- Ruby SDK is written like a labyrinth with very little ability to extend without knowing every little detail about the rest of the library. I couldnt even understand how a request was created it was so convoluted.
Basically with the ruby client you can put data in but can’t get it out efficiently without paying a ton for beefed up read operations.
I think that only the lack of support for async writes and the complexity of the Ruby SDK are really Amazon DynamoDB related issues; I assume the first one has been a temporary issue. Everything else is DynamoDB’s documented behavior and so one is supposed to be aware of these when designing their applications.
As far as I know, Amazon DynamoDB has been in private beta for a while with real production users. But that doesn’t mean that DynamoDB will be the right solution for everyone. And that’s not equivalent with saying that DynamoDB is not production ready.
Original title and link: Amazon DynamoDB Is Not Production Ready (NoSQL database©myNoSQL)
Characteristics of Machine Learning Models
Ricky Ho published yet another great article giving a high level summary of the algorithms used by different machine learning models:
- decision trees
- linear regression methods
- neural networks
- bayesian networks
- support vector machines
- nearest neighbor
For classification and regression problem, there are different choices of Machine Learning Models each of which can be viewed as a blackbox that solve the same problem. However, each model come from a different algorithm approaches and will perform differently under different data set. The best way is to use cross-validation to determine which model perform best on test data.
Original title and link: Characteristics of Machine Learning Models (NoSQL database©myNoSQL)
Data Scientists Are Hot
Based on a couple of searches on job sites and an email from a headhunter, GigaOM Barb Darrow concludes that data scientists are in high demand these days:
My client is one of the largest professional services firms in the world and they are looking for very senior data analytics experts who can apply his/her advanced analytics, predictive modeling, and data visualization skills to the fraud/dispute arena. Exceptional compensation packages are available in the $300,000 to $500,000 range for the appropriate technical and leadership experience.
There’s no denial of the fact that data scientists are hot and Darrow is not the first one writing about it. Hal Varian, Chief Economist at Google, said many years ago: “I keep saying that the sexy job in the next 10 years will be statisticians”. Many others have already agreed that the future belongs to the companies and people that turn data into products. And I remember reading recently about some reports mentioning 150-200,000 jobs in this market in the next couple of years.
On the other hand though, there are various myths about data scientists’ role. Job descriptions will mention many years of experience with Hadoop and Big Data. But even if there are some hints about what makes a good data scientist and how to hire the right data geeks, there’s no alignment on what data science is and what is involved in the role of the data scientist.
This still feels like the early days when requirements and expectations are changing overnight. But these are also the days when most of those involved are having a lot of fun learning and discovering new ways to deal with data and defining the tomorrow.
Original title and link: Data Scientists Are Hot (NoSQL database©myNoSQL)
Monitoring MongoDB With Monitis
Three common issues and how Monitis can help:
- Too many connections: Open connections consume resources. Â Too many open connections can take down your entire instance even if they are not running any queries.
- More data than RAM: Keeping an eye on the “virtual memory” is the best way to gauge the performance of your Mongo instances.
- Timeouts: Â With a custom Monitis Monitor reading the HTTP Console you can find out if your database is throwing timeouts.
Monitis is for those running their own monitoring infrastructure. But if you want a hosted monitoring solution there are a couple of options available too: 10gen’s MongoDB Monitoring Service, Boxed Ice Server Density, and probably more that I’m forgetting right now.
Original title and link: Monitoring MongoDB With Monitis (NoSQL database©myNoSQL)
Lightning Talk on Cascalog
Just 19 slides, but Paul Lam manages to provide both a comparison of Cascalog and Hive, plus an overview of the most interesting bits of Cascalog.
Cascalog vs Hive

Highly recommended for understanding what’s in the Cascalog box.
Original title and link: Lightning Talk on Cascalog (NoSQL database©myNoSQL)
Neo4J Spatial and Gephi for Smart Data Analysis
As I often run the same course, it would be interesting to calculate my average pace at specific locations. When combining the data of all of my courses, I could deduct frequently encountered locations. Finally, could there be a correlation between my average pace and my distance from home? In order to come up with answers to these questions, I will import my running data into a Neo4J Spatial datastore. Neo4J Spatial extends the Neo4J Graph Database with the necessary tools and utilities to store and query spatial data in your graph models. For visualizing my running data, I will make use of Gephi, an open-source visualization and manipulation tool that allows users to interactively browse and explore graphs.
This looks like a great application of a graph database for analyzing geo data. And it’s very practical.
Original title and link: Neo4J Spatial and Gephi for Smart Data Analysis (NoSQL database©myNoSQL)
Dealing With JVM Limitations in Apache Cassandra
A couple of most notable NoSQL databases targeting large scalable systems are written in Java: Cassandra, HBase, BigCouch. Then there’s also Hadoop. Plus a series of caching and data grid solutions like Terracotta, Gigaspaces. They are all facing the same challenge: tuning the JVM garbage collector for predictable latency and throughput.
Jonathan Ellis’s slides presented at Fosdem 2012 are covering some of the problems with GC and the way Cassandra tackles them. While this is one of those presentations where the slides are not enough to understand the full picture, going through them will still give you a couple of good hints.
For those saying that Java and the JVM are not the platform for writing large concurrent systems, here’s the quote Ellis is finishing his slides with:
Cliff Click: Many concurrent algorithms are very easy to write with a GC and totally hard (to down right impossible) using explicit free.
Enjoy the slides after the break.
Dealing with JVM limitations in Apache CassandraOriginal title and link: Dealing With JVM Limitations in Apache Cassandra (NoSQL database©myNoSQL)
Website Statistics Using Node.js and Redis
A particular good fit are the fast, atomically incremented counters which can be used in multiple dimensions like: hits-by-app, hits-by-url, hits-by-day, plus combinations. The following gist gives a good idea:
When working on similar analytics solutions, keep in mind that data gathering is only one side of the problem. The other is reporting. And the way you store collected data has a huge impact on what types of reporting you’ll be able to pull.
Original title and link: Website Statistics Using Node.js and Redis (NoSQL database©myNoSQL)
RavenDB in the Cloud: CloudBird
I’ve missed mentioning the private beta RavenDB hosting service CloudBird in the third wave of hosted and managed NoSQL services. For now, I don’t have any other details about their services. Just an email regform.
Original title and link: RavenDB in the Cloud: CloudBird (NoSQL database©myNoSQL)
A Question About NoSQL Managed Hosting
It’s impossible to always have the right answers to all the questions. So this time I’ll have to ask you all: why only some NoSQL databases are present in managed hosting offers?
The first wave of NoSQL managed hosting services brought MongoDB, CouchDB, and some Redis. The second wave brought some more MongoDB, CouchDB, and just a bit more of Redis. It was only the third wave that brought some managed services for graph databases: Neo4j and OrientDB. Plus the first proposal for Cassandra managed hosting.
The first answer that comes to mind when thinking about NoSQL managed services is adoption. If a product is not in wide use then the chances for a company to run a profitable hosting business are very low. But I have the feeling that this is not the only or the complete answer.
Please chime in and share your thoughts.
Original title and link: A Question About NoSQL Managed Hosting (NoSQL database©myNoSQL)
Cassandra at Clearspring with Chris Burroughs - Powered by NoSQL
For today’s Powered by Cassandra video from the Cassandra NYC 2011 event organized by DataStax, I chose Chris Burroughs’s presentation about Clearspring’s usage of Cassandra. Just in case you wonder what Clearspring is doing, the sharing buttons you see here on myNoSQL are powered by AddThis product from Clearspring.
Chris Burroughs talks about Clearsping has started to use Cassandra and the lessons they’ve learned.
For watching more videos from this event follow the Cassandra NYC 2011 tag.
Original title and link: Cassandra at Clearspring with Chris Burroughs - Powered by NoSQL (NoSQL database©myNoSQL)