Send me a patch for that
This post is in reply to Hadi’s post. Please go ahead and read it.
Done? Great, so let me try to respond, this time, from the point of view of someone who regularly asks for patches / pull requests.
To make things more interesting, the project that I am talking about now is RavenDB which is both Open Source and commercial. Hadi says:
Numerous times I’ve seen reactions from OSS developers, contributors or merely a simple passer by, responding to a complaint with: submit a patch or well if you can do better, write your own framework. In other words, put up or shut up.
Hadi then goes on to explain exactly why this is a high barrier for most users.
- You need to familiarize yourself with the codebase.
- You need to understand the source control system that is used and how to send a patch / pull request.
And I would fully agree with Hadi that those are stumbling blocks. I can’t speak for other people, but in our case, that is the intention.
Nitpicker corner here: I am speaking explicitly and only about features here. Bugs gets fixed by us (unless the user already submitted a fix as well).
Put simply, there is an issue of priorities here. We have a certain direction for the project that we want to take it. And in many cases, users want things that are out of scope for us for the foreseeable future. Our options then become:
- Sorry, ain’t going to happen.
- Sure, we will push aside all the work that we intended to do to do your thing.
- No problem, we added that to the queue, expect it in 6 – 9 months, if we will still consider it important then.
None of which is an acceptable answer from our point of view.
Case in point, facets support in RavenDB was something that was requested a few times. We never did it because it was out of scope for our plan, RavenDB is a database server, not a search server and we weren’t really sure how complex this would be and how to implement this. Basically, this was an expensive feature that wasn’t in the major feature set that we wanted. The answer that we gave people is “send me a pull request for that”.
To be clear, this is basically an opportunity to affect the direction of the project in a way you consider important. What ended up happening is that Matt Warren took up the task and created an initial implementation. Which was then subject to intense refactoring and finally got into the product. You can see the entire conversation about this here. The major difference along the way is that Matt did all the research for this feature, and he had working code. From there the balance change. It was no longer an issue of expensive research and figuring out how to do it. It was an issue of having working code and refactoring it so it matched the rest of the RavenDB codebase. That wasn’t expensive, and we got a new feature in.
Here is another story, a case where I flat out didn’t think it was possible. About two years ago Rob Ashton had a feature suggestion (ad hoc queries with RavenDB). Frankly, I thought that this was simply not possible, and after a bit of back and forth, I told Rob:
Let me rephrase that.
Dream up the API from the client side to do this.
Rob went away for a few hours, and then came back with a working code sample. I had to pick my jaw off the floor using both hands. That feature got a lot of priority right away, and is a feature that I routinely brag about when talking about RavenDB.
But let me come back again to the common case, a user request something that isn’t in the project plan. Now, remember, requests are cheap. From the point of view of the user, it doesn’t cost anything to request a feature. From the point of view of the project, it can cost a lot. There is research, implementation, debugging, backward compatibility, testing and continuous support associated with just about any feature you care to name.
And our options whenever a user make a request that is out of line for the project plan are:
- Sorry, ain’t going to happen.
- Sure, we will push aside all the work that we intended to do to do your thing.
- No problem, we added that to the queue, expect it in 6 – 9 months, if we will still consider it important then.
Or, we can also say:
- We don’t have the resources to currently do that, but we would gladly accept a pull request to do so.
And that point, the user is faced with a choice. He can either:
- Oh, well, it isn’t important to me.
- Oh, it is important to me so I have better do that.
In other words, it shift the prioritization to the user, based on how important that feature is.
We recently got a feature request to support something like this:
session.Query<User>() .Where(x=> searchInput.Name != null && x.User == searcInput.Name) .ToArray();.csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; }
I’ll spare you the details of just how complex it is to implement something like that (especially when it can also be things like: (searchInput.Age > 18). But the simple work around for that is:
var q = session.Query<User>(); if(searchInput.Name != null) q = q.Where(x=> x.User == searcInput.Name); q.ToArray();.csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; }
Supporting the first one is complex, there is a simple work around that the user can use (and I like the second option from the point of view of readability as well).
That sort of thing get a “A pull request for this feature would be appreciated”. Because the alternative to that is to slam the door in the user’s face.
A new DB2 clock has started: End of Service for DB2 9
Limit your abstractions: Application Events–Proposed Solution #1
In my previous post, I explained why I really don’t the following.
public class CargoInspectionServiceImpl : ICargoInspectionService
{
// code redacted for simplicity
public override void InspectCargo(TrackingId trackingId)
{
Validate.NotNull(trackingId, "Tracking ID is required");
Cargo cargo = cargoRepository.Find(trackingId);
if (cargo == null)
{
logger.Warn("Can't inspect non-existing cargo " + trackingId);
return;
}
HandlingHistory handlingHistory = handlingEventRepository.LookupHandlingHistoryOfCargo(trackingId);
cargo.DeriveDeliveryProgress(handlingHistory);
if (cargo.Delivery.Misdirected)
{
applicationEvents.CargoWasMisdirected(cargo);
}
if (cargo.Delivery.UnloadedAtDestination)
{
applicationEvents.CargoHasArrived(cargo);
}
cargoRepository.Store(cargo);
}
}
Now, let us see one proposed solution for that. We can drop the IApplicationEvents.CargoHasArrived and IApplicationEvents.CargoWasMisdirected, instead creating the following:
public interface IHappenOnCargoInspection
{
void Inspect(Cargo cargo);
}
.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, "Courier New", courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }
We can have multiple implementations of this interface, such as this one:
public class MidirectedCargo : IHappenOnCargoInspection
{
public void Inspect(Cargo cargo)
{
if(cargo.Delivery.Misdirected == false)
return;
// code to handle misdirected cargo.
}
}
.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, "Courier New", courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }
In a similar fashion, we would have a CargoArrived implementation, and the ICargoInspectionService would be tasked with managing the implementation of IHappenOnCargoInspection, probably through a container. Although I would probably replace it with something like:
container.ExecuteOnAll<IHappenOnCargoInspection>(i=>i.Inspect(cargo));.csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; }
All in all, it is a simple method, but it means that now the responsibility to detect and act is centralized in each cargo inspector implementation. If the detection of misdirected cargo is changed, we know that there is just one place to make that change. If we need a new behavior, for example, for late cargo, we can do that by introducing a new class, which implement the interface. That gives us the Open Closed Principle.
This is better, but I still don’t like it. There are better methods than that, but we will discuss them in another post.
The question: Would You Like More WLM Information In DB2 Accounting Trace
Announcing Percona Toolkit Release 2.0.3
LevelDB: SSTable and Log Structured Storage
Ilya Grigorik digs into LevelDB’s SSTable and log structured storage1:
If Protocol Buffers is the lingua franca of individual data record at Google, then the Sorted String Table (SSTable) is one of the most popular outputs for storing, processing, and exchanging datasets. As the name itself implies, an SSTable is a simple abstraction to efficiently store large numbers of key-value pairs while optimizing for high throughput, sequential read/write workloads.
Even if not very talked about, LevelDB is making notable contributions to the NoSQL space: the leveled compaction strategy in Cassandra 1.0 is based on LevelDB and Riak ships with LevelDB since 1.0.
-
Make sure you are not missing Writes Performance: B+Tree, LSM Tree, Fractal Tree ↩
Original title and link: LevelDB: SSTable and Log Structured Storage (NoSQL database©myNoSQL)
The Design of 99designs - A Clean Tens of Millions Pageviews Architecture
By pure coincidence, General Chicken just published on High Scalability a bullet point summary of the 99designs architecture I’ve linked and commented on earlier.
Original title and link: The Design of 99designs - A Clean Tens of Millions Pageviews Architecture (NoSQL database©myNoSQL)
99designs: Powered by Amazon RDS, Redis, MongoDB, and Memcached
While the authoritative storage is Amazon RDS, 99designs is using Redis, MongoDB, and Memcached for transient data:
We log errors and statistics to capped collections in MongoDB, providing us with more insight into our system’s performance. Redis captures per-user information about which features are enabled at any given time; it supports our development stragegy around dark launches, soft launches and incremental feature rollouts.
It’s also worth noting the nice things they say about using Amazon RDS:
An RDS instance configured to use multiple availability zones provides master-master replication, providing crucial redundancy for our DB layer. This feature has already saved our bacon multiple times: the fail over has been smooth enough that by the time we realised anything was wrong, another master was correctly serving requests. Its rolling backups provide a means of disaster recovery. We load-balance reads across multiple slaves as a means of maintaining performance as the load on our database increases.
Original title and link: 99designs: Powered by Amazon RDS, Redis, MongoDB, and Memcached (NoSQL database©myNoSQL)
Data Grid or NoSQL? What are the common points? The main differences?
A great post by Olivier Mallassi on a topic that comes up very often: how do data grids and NoSQL databases compare?
- Data Grids enable you controlling the way data is stored. They all have default implementation (Gigaspaces offers RDBMS by default, Gemfire offers file and disk based storage by default….) but in all cases, you can choose the one that fits your needs: do you need to store data, do you need to relieve the existing databases….
- In order to minimize the latency, data grids enable you to store data synchronously (write-through) or asynchronously (write-behind) on disk. You can also define overflow strategies. In that case, data is store in memory up to a treshold where data is flushed on disk (following algorithms like LRU …). NoSQL solutions have not been designed to provide these features.
- Data grids enable you developing Event Driven Architecture.
- Querying is maybe the point on which pure NoSQL solutions and data grids are merging.
- Data grids enable near-cache topologies.
Taking a step back you’ll notice that there are actually more similarities than differences. While Oliver Mallasi lists the above points as features that prove data grids as being more configurable and so more adaptable, some of these do exist also in the NoSQL databases taking different forms:
- pluggable storage backends. Not many of the NoSQL databases have this feature,but Riak and Project Voldemort are offering different solutions that are optimized for specific scenarios.
- replicated and durable writes. Not the same as synchronous vs asynchronous writes, but a different perspective on writes.
- Notification mechanisms. Once again not all of the NoSQL databases support notification mechanisms, but a couple of them have offer some interesting approaches:
- CouchDB: _changes feed with filters
- Riak: pre-commit and post-commit hooks
- HBase coprocessors
- Most of the NoSQL database have local per-node caches.
With these, I’ve probably made things even blurrier. But let me try to draw a line between data grids and NoSQL databases. Data grids are optimized for handling data in memory. Everything that spills over is secondary. On the other hand, NoSQL databases are for storing data. Thus they focus on how they organize data (on disk or in memory) and optimize access to it. Data grids are a processing/architectural model. NoSQL databases are storage solutions.
Original title and link: Data Grid or NoSQL? What are the common points? The main differences? (NoSQL database©myNoSQL)
Redis and Python: Building a Markov-chain IRC bot
Charles Leifer:
As an IRC bot enthusiast and tinkerer, I would like to describe the most enduring and popular bot I’ve written, a markov-chain bot. Markov chains can be used to generate realistic text, and so are great fodder for IRC bots.
Redis acts, in many ways, like a big python dictionary that can store several types of useful data structures. For our purposes, we will use the set data type. The top-level keyspace will contain our “keys”, which will be encoded links in our markov chain. At each key there will be a set of words that have followed the words encoded in the key. To generate “random” messages, we’ll use the “SRANDMEMBER” command, which returns a random member from a set.
- You could use other NoSQL database for this, but you’d miss Redis’s support for sets and the O(1) SRANDMEMBER
- On the other hand imagine storing the corpus in a graph database where the nodes would represent the words and vertices would carry to pieces of information: the frequency of the connection in the corpus and the frequence of the connection used to generate the output.
Original title and link: Redis and Python: Building a Markov-chain IRC bot (NoSQL database©myNoSQL)
Calculating a Graph's Degree Distribution Using R MapReduce over Hadoop
Marko Rodriguez is experimenting with R on Hadoop and one of his exercises is calculating a graph’s degree distribution. I confess I had to use Wikipedia for reminding what’s the definition of a node degree:
- The degree of a node in a network (sometimes referred to incorrectly as the connectivity) is the number of connections or edges the node has to other nodes. The degree distribution P(k) of a network is then defined to be the fraction of nodes in the network with degree k.
- The degree distribution is very important in studying both real networks, such as the Internet and social networks, and theoretical networks.
As an imagination exercise think of a graph database that’s actively maintaining an internal degree distribution and uses it to suggest or partition the graph. Would that work?
Original title and link: Calculating a Graph’s Degree Distribution Using R MapReduce over Hadoop (NoSQL database©myNoSQL)
MongoDB vs MySQL: A DevOps point of view
Pierre Bailet and Mathieu Poumeyrol of fotopedia (a French photo site) share their experience of operating a small MongoDB cluster since Sep.2009 compared to a MySQL cluster.
Some details about fotopedia:
- fotopedia is 100% on AWS
- Amazon RDS for MySQL
- 4 nodes MongoDB cluster
- 150mil. photo views
MongoDB advantages:
- no alter table
- background index creation
- data backup & restoration
- note: as far as I can tell MySQL is able to do the same
- replica sets
- hardware migration
- note: the same procedure can be used for MySQL
Before leaving you with the slides, here is an interesting accepted trade-off:
Quietly losing seconds of writes is preferable to:
- weekly minutes-long maintenance periods
- minutes-long unscheduled downtime and manual failover in case of hardware failures
Original title and link: MongoDB vs MySQL: A DevOps point of view (NoSQL database©myNoSQL)
Whirr and Hadoop Quickstart Guide: Automating a Rackspace Hadoop Cluster
Even if most of the examples show Whirr in action on the Amazon cloud, Whirr it’s cloud-neutral. Bob Gourley uses Whirr to fire up a CDH1 cluster on Rackspace.
-
Cloudera Distribution of Hadoop. ↩
Original title and link: Whirr and Hadoop Quickstart Guide: Automating a Rackspace Hadoop Cluster (NoSQL database©myNoSQL)
Speaking at MySQL Meetup in Charlotte,NC
Limit your abstractions: Application Events–what about change?
In my previous post, I showed an example of application events and asked what is wrong with them.
public class CargoInspectionServiceImpl : ICargoInspectionService
{
// code redacted for simplicity
public override void InspectCargo(TrackingId trackingId)
{
Validate.NotNull(trackingId, "Tracking ID is required");
Cargo cargo = cargoRepository.Find(trackingId);
if (cargo == null)
{
logger.Warn("Can't inspect non-existing cargo " + trackingId);
return;
}
HandlingHistory handlingHistory = handlingEventRepository.LookupHandlingHistoryOfCargo(trackingId);
cargo.DeriveDeliveryProgress(handlingHistory);
if (cargo.Delivery.Misdirected)
{
applicationEvents.CargoWasMisdirected(cargo);
}
if (cargo.Delivery.UnloadedAtDestination)
{
applicationEvents.CargoHasArrived(cargo);
}
cargoRepository.Store(cargo);
}
}
.csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /white-space: pre;/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; }
This is very problematic code, from my point of view, for several reasons. Look at how it allocate responsibilities. IApplicationEvents is supposed to execute the actual event, but deciding when to execute the event is left for the caller. I said several reasons, but this is the main one, all other flows from it.
What happen when the rules for invoking an event change? What happen when we want to add a new event?
The way this is handled is broken. It violates the Open Closed Principle, it violates the Single Responsibility Principle and it frankly annoys me.
Can you think about ways to improve this?
I’ll discuss some in my next post.
Using Twitter Storm to analyze the Twitter Stream
Francisco Jordano introduces briefly the concepts and provides some good resources for learnign about Twitter Storm just to present his experiment of using Twitter Storm for analyzing the Twitter (nb: the project is on GitHub ):

That’s how the information will flow, and the kind of tasks that we will execute. Yes it’s more effective to group some of those tasks, but remember, we just wanted to give this a try ;P
Worth emphasizing is that nowhere in the post is any reference of him having any troubles finding documentation or getting Twitter Storm up and running.
Original title and link: Using Twitter Storm to analyze the Twitter Stream (NoSQL database©myNoSQL)
Let's talk about EXPLAIN and how it uses the QUERYNO column
Procedures / Procedimentos Owner vs Restricted
Este artigo está escrito em Inglês e Português
English version:
Introduction
This article focus on a little known aspect of stored procedures or functions. That probably explains why it was the less voted in a recent poll I've conducted. Nonetheless it's (from my point of view) a very interesting topic. During this article I'll be referring to procedures, but I could use the term functions.
If we take a look at the sysprocedures table we'll see a field called mode. This field is just one character and the values it can contain are:
- D or d
DBA - O or o
Owner - P or p
Protected - R or r
Restricted - T or t
Trigger
CREATE PROCEDURE test()You'll have an OWNER mode procedure, owned by informix user. But if instead you run:
END PROCEDURE
CREATE PROCEDURE myuser.test()You'll have a RESTRICTED mode procedure owned by myuser.
END PROCEDURE
You need to have DBA privilege to create a procedure on behalfwith another user name.
Why RESTRICTED?
The reasons why the restricted mode procedures/functions were created are based on security. Let's imagine the following scenario:
- You have two databases called db1 and db2
- You have a user myuser with connect privileges on db1 and db2 and another user mydba with DBA privileges on db1
- User myuser needs to be connected to db1 and run a distributed query to db2
- The db2's DBA grants the required privileges on db2 to user myuser
This is why the RESTRICTED mode was created. Every time we create a procedure on behalf of another user, it will be created as a RESTRICTED mode procedure. And as such any remote operation will be done using the currently logged user and not with the identity of the procedure owner (as it happens with OWNER mode procedures).
Other implications
So, the reasons for the creation of this new mode are explained and are good reasons. But there can be another implication. Note that I'll be referencing a product issue, but it's highly probable that you'd never notice it. But the fix for that bug introduced new limits and a new error so it can be interesting to dig a bit deeper on this.
Whenever we make a remote connection inside a statement we need to open a new database. And we need to keep a record of the current opened ones. The structure of the opened databases used to be an array of "only" 8 positions. And in certain conditions we could wrap around it without raising an error. And this could lead to a nasty situation where the "current" database was not the one it should be. I noticed this on a customer environment when we started to get error -674 (procedure not found) on a procedure called from a trigger. Why is this related to the restricted vs owner mode procedures? Because with the mixed use of restricted and owner mode procedures we raise the possibility of having the same database opened with different users (the owners and our current user).
Please don't be scared with this problem. The situation I got involved around 60 objects (tables and procedures) linked together by a complex sequence of triggers that called procedures, that made INSERTs/UPDATEs/DELETEs which in turn called other procedures etc..
This sequence was started by a simple INSERT. And it involved 5 databases. The array I mentioned earlier had 8 positions.
Since then, we fixed several things and now (11.50.xC9 and 11.70.xC3):
- The array was increased to 32 positions
- If we still achieve that limit a proper error will be raised (-26600)
- The documentation was improved (it didn't mention any limit and it still mentions 8, but it should be fixed soon)
Introdução
Este artigo foca um aspecto pouco conhecido das stored procedures (ou funções). O facto de ser desconhecido deve ajudar a explicar porque foi o menos votado para artigos num inquérito que realizei há pouco tempo. Apesar disso, é um assunto interessante (do meu ponto de vista). Durante este artigo irei referir na maior parte das vezes "procedimentos". Mas podemos assumir "funções".
Se dermos uma vista de olhos à tabela sysprocedures podemos reparar que contém uma coluna com o nome mode. É apenas um caracter e os valores que pode conter são:
- D or d
DBA - O or o
Owner - P or p
Protected - R or r
Restricted - T or t
Trigger
CREATE PROCEDURE teste()Ficaremos com um procedimento em modo OWNER, cujo dono Ă© o informix. Mas se em vez disso fizermos:
END PROCEDURE
CREATE PROCEDURE myuser.teste()Ficaremos com um procedimento em modo RESTRICTED cujo dono Ă© o myuser.
END PROCEDURE
É necessário ter privilégios de DBA para criar procedimentos em nome de outro utilizador.
PorquĂŞ RESTRICTED?
As razões que levaram à criação do modo RESTRICTED para funções e procedimentos prendem-se com segurança. Vamos imaginar o seguinte cenário:
- Temos duas bases de dados chamadas bd1 e bd2
- Temos um utilizador myuser com privilégios de CONNECT em bd1 e bd2 e outro utilizador mydba com privilégios de DBA na bd1
- O utilizador myuser necessita de, estando conectado Ă bd1, correr uma query distribuĂda Ă bd2
- O DBA da bd2 faz o GRANT dos privilégios necessários na bd2 ao utilizador myuser
Esta foi a razão que levou à criação deste novo modo. Em termos práticos, um procedimento criado como RESTRICTED executa todas as operações remotas com a identidade do utilizador que a está a executar e não com a identidade do utilizador que está definido como dono (que pode ser diferente de quem a criou).
Outras implicações
Portanto, as razões para a introdução deste novo modo estão apresentadas e são boas razões. Mas podem existir outras implicações. De seguida irei referir um bug do produto, mas é altamente improvável que venha a encontrá-lo. Mas a correcção introduziu algumas alterações que são dignas de nota e que valerão a pena gastar algum tempo com elas.
Cada vez que fazemos uma conexĂŁo remota, dentro de uma instrução SQL, temos de abrir a base de dados remota. E necessitamos de manter um registo das bases de dados abertas em cada momento. A estrutura que mantĂ©m essa informação era um array de "apenas" 8 posições. E em determinadas situações poderĂamos "dar a volta" sem despoletar um erro apropriado. E isto poderia dar origem a uma situação onde a base de dados "actual" nĂŁo era a que deveria ser (devido Ă forma como eram abertas e fechadas as ligações durante a execução de uma instrução SQL). Deparei-me com isto num ambiente de um cliente onde começamos a obter o erro -674 (procedure not found) num procedimento despoletado por um trigger. Como Ă© que isto se relaciona com o tema deste artigo? Porque o uso misto de procedimentos em modo RESTRICTED e OWNER potencia um maior nĂşmero de bases de dados abertas em simultâneo (cada conexĂŁo tem um utilizador especĂfico associado que conforme o modo pode ser o dono dos procedimentos ou o utilizador da sessĂŁo).
Não fique assustado com este problema. Para melhor enquadrar, na situação que encontrei existiam cerca de 60 objectos (tabelas e procedimentos) ligados por uma complexa teia de triggers e procedimentos (triggers que chamavam procedimentos que fazia INSERTs, UPDATEs e DELETEs, que por sua vez faziam disparar outros triggers e assim sucessivamente).
A sequência era despoletada por um simples INSERT e envolvia 5 bases de dados distintas. O array mencionado anteriormente tinha apenas 8 posições.
Isto levou a várias correcções e agora (11.50.xC9 e 11.70.xC3):
- O array foi incrementado para 32 posições
- Se alguma vez atingirmos este limite (espero sinceramente que não) um erro apropriado será retornado (-26600)
- A documentação foi melhorada (não mencionada qualquer limite, sendo que de momento ainda refere 8... Deve ser corrigido brevemente)
Research in the MapReduce Space
Over the weekend I’ve read two papers presenting products or research related to improving or adding new capabilities to the MapReduce data processing approach. The first of them comes from a team at Microsoft and is describing TiMR a time-oriented data processing system in MapReduce. The second, from a team at Google, presents Tenzin - a SQL implementation on the MapReduce framework. It’s great to learn that while the Hadoop community is eliminating some of the initial limitations and hardening the technical details of the platform, there are already ideas and systems out there that augment the capabilities of the MapReduce data processing model.
Original title and link: Research in the MapReduce Space (NoSQL database©myNoSQL)
Paper: Tenzing A SQL Implementation on the MapReduce Framework
This recent paper from a team at Google is presenting details about Tenzing a system that is currently in use at Google:
Tenzing is a query engine built on top of MapReduce for ad hoc analysis of Google data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility.
A couple of things I’ve highlighted when reading it:
- Tenzing is in production, but doesn’t serve yet a huge amount of queries
- the backend storage can be a mix of various data stores, such as ColumnIO, Bigtable, GFS files, MySQL databases
- when compared with other similar solutions (Sawzall, Flume-Java, Pig, Hive„ HadoopDB), Tenzing’s advantage is low latency
- the paper acknowledges AsterData, GreenPlum, Paraccel, Vertica for using a MapReduce execution model in their engines
- to perform query optimizations, Tenzing is enhancing queries with information from a metadata server
- there is no information about what kind of metadata is needed in Tenzing. I assume it might refer to details about the data sources and data source metadata (indexes, access patterns, etc)
- to reduce query latency, processes are kept running
- Tenzing supports almost all SQL92 standard and some extensions from SQL99
- projection and filtering (for some of these and depending on the data source Tenzing can do some optimizations)
- set operations (implemented in the reduce phase)
- nested queries and subqueries
- aggregation and statistical functions
- analytic functions (syntax similar to PostgreSQL/Oracle)
- OLAP extensions
-
JOINs:
Tenzing supports efficient joins across data sources, such as ColumnIO to Bigtable; inner, left, right, cross, and full outer joins; and equi semi-equi, non-equi and function based joins. Cross joins are only supported for tables small enough to fit in memory, and right outer joins are supported only with sort/merge joins. Non-equi correlated subqueries are currently not supported. We include distributed implementations for nested loop, sort/merge and hash joins.
Read and download the “Tenzing A SQL Implementation on the MapReduce framework” after the break.
Download PDFOriginal title and link: Paper: Tenzing A SQL Implementation on the MapReduce Framework (NoSQL database©myNoSQL)