Rob, it’s nice to be here with you today. I have a couple of questions regarding The Data Access Handbook. In particularly, in The Data Access Handbook, you write about the performance penalties associated with data encryption. Do you have any tips for improving throughput? We’re just talking about the data encryption topic.
Rob Steward:
Great question, Mike. What we’ve found in our lab when we do benchmarking of using strong encryption versus no encryption when it comes to data access (all that data that you’re requesting from the database that flows back across. Particularly when we talk about data encryption most people mean SSL – some sort algorithm through SSL to encrypt the data), what we’ve found in the past in our benchmarking is that the encryption itself causes about 100% overhead. In other words, if I were to fetch a result set and it were to take ten seconds, if I were to turn on and strongly encrypt through SSL that entire result set would probably take 20 seconds instead of 10 seconds. So it about doubles the amount of time involved in communicating with the database. So it’s a very good question, particularly in the regulatory environment today where a lot of government regulations are forcing people to encrypt a lot of data that they weren’t in the past.
What I can tell you is that overhead – that overhead about 100% that goes along with the data encryption – I have not seen any way to significantly reduce that. You’re kind of stuck. You’re going to take a penalty, and I think everybody knows that. But that penalty, from my experience, is about 100%. Now, what that does do is make it even more important that you follow the tips that we give you in The Data Access Handbook in order to reduce the amount of data and reduce the amount of roundtrips across the network. So, if I can take that result set and reduce it in size, the amount of time that it’s going to transmit that data to the client is going to be significantly reduced, obviously – even without encryption. But again, if you’re going to take your existing application, or you’re going to take some piece of code that you’ve got that is accessing data, it’s even more important that you make sure that it’s tuned to reduce the amount of data.
There are a lot of tips and tricks in the book that talk about how to reduce the amount of data that comes across the network, and I’ll refer you to the book. But particularly with encryption on, you really need to reduce the size of that data.
This podcast is the second of a two part series. In it, Rob Steward discusses what to look for in a highly effective database JDBC driver. The podcast runs for 6:42
Now in JDBC terms, JDBC the specifications formalized this concept of the architecture of the driver itself. They called it Type 1, Type 2, Type 3, and Type 4. So a Type 1 driver – which there was only ever one – was a bridge from JDBC to ODBC. Type 2 is what I just described where the driver, the Java piece of that driver sits on top of some native Windows or Solaris or Linux client piece that talked to the database. Type 3 was pure Java, and it talked to some intermediate server. So it may be just a JDBC driver that’s pure Java, but it had some server component that it would talk to that it would then in tern talk to the Oracle or the DB2 database, or whatever database. And then Type 4 – which is most common – is pure Java, opens up that TCIP socket to the database and talks directly to the database. In ODBC terms we call that wire protocol. In JDBC it’s a Type 4. So that architecture makes a big difference. In the ADO.NET world, we call that 100% managed. Something that is completely running within that CLR that opens up that socket to the database server.
Now this is a huge deal. This architecture, and the reason that I’m spending so much time in answering this question on that one particular subject, is that architecture – not only does it matter for say the versioning conflict that I talked about – if you’ve got a Type 2 in JDBC or anything that’s not completely managed in .NET, then you’re giving up one of the biggest benefits of those environments: the platform independence; the ability to, within your single process, be able to have all the assemblies or the components that you need for that application. If you have some dependence on the native operating system, then you’re giving up those big advantages. So you run into those versioning issues that I talk about, or you run into conflicts among shared objects. If you have a Type 4 JDBC or 100% pure managed .NET, you don’t have those issues. With ODBC you can eliminate a bunch of these conflicts because just eliminate a number of components that you need.
In addition to this versioning and compatibility issue, it actually makes a really big difference in terms performance and scalability. So if you think about it, in computer science we’re always taught in school to simplify things. The simpler the algorithm the better. It’s not just more elegant; it’s actually better performance.
The first class I took in college where I was dealing with data structures and sorting, the professor walked in and said, ‘okay, write a bubble sort algorithm.’ So we wrote a bubble sort, and we turned it in, and as soon as we turned it the professor said, ‘now that you’ve done that, never do that again.’ Now why did he do that? He did that because the algorithm is somewhat complex, but the reason that we were never supposed to write it again is because it was inefficient. We can write a much better binary search or something like that, which is actually much simpler but also performs significantly better than that bubble sort. This is the kind of thing we’re taught in computer science, and that’s the reason that we’re taught it: scalability and performance.
So if you have less layers and less complex interactions, what you end up with is better scalability and performance. For example, specifically, you may retrieve some data from the database and it may be buffered in that client layer. Well then it’s got to make a copy to hand up to that driver layer above it. So we may end up using twice the amount or memory that we need as apposed to if that driver was stand-alone and doesn’t have that other layer. Also, if you get a driver that’s wire protocol ODBC, Type 4 JDBC, 100% managed ADO.NET, that driver is built specifically to handle the API that you’re writing to. So if you’ve written an ODBC application, then that driver has the capabilities and the code written into it to handle ODBC. It doesn’t need to handle other things that are not ODBC. So if you have that other layer under there – which is the database client piece – which is built to handle more than ODBC or JDBC or ADO.NET underneath it, then there are complexities and codes in it that you don’t need. This causes it to not perform or scale as well.
In a nutshell I would say that you want to look for the architecture of the driver that really matters. You also want to look for experience. A company that writes a single ODBC driver of a single JDBC driver is not going to do as well at writing those drivers as a company that writes 5 or 10 or 20 of them. Why is that important? Well when you write a bunch of different drivers, you understand what ODBC or JDBC or ADO.NET applications need. You understand how they interact with the drivers better because you have a much broader area of experience. And you’re able to optimize those things within those drivers. So I would say the broad experience of the company that writes those drivers, as well as that architecture.
Another thing that I would look at is of the vendor who writes the driver. Is the driver a profit source for them? If you have vendor who say, gives the driver away for free, then they don’t have the incentive to write as good of a driver. It’s kind of the ‘you get what you pay for’ kind of a thing; absolutely true with drivers as well. And, as we just wrote a book on the subject: What kind of difference can those drivers make? Absolutely huge.
I would say that you want to look at the vendor; you want to look at what they make; you want to look at the architecture of those drivers. Just a few tips there on what I would look for in terms of a driver.
Rob, what are some guidelines for data access and service oriented architecture, and why are both data experts and SOA experts needed to ensure success?
Rob Steward:
Well I guess the primary answer to that is that what we’ve seen over the last four to five years – as people have started to implement SOA environments and roll them out into production – is that the people who are in charge of those environments and in charge of these projects, which are always very large projects, are typically your SOA experts. They understand what it means to take some bit of business logic, encapsulate it into a service, and then how do you expose that? How do you represent that to all the different application groups that may use that service? Those are the people who are typically in charge of these projects.
So what we’ve seen happen over and over is that these guys design the service, but when they design the services what they’re not experts on data access. So they design them with service orientation in mind, but not necessarily with what is the best and fastest way to access the data within those services.
Most services out there, probably 75-90% of the services out there, actually access some kind of data. Well within that service you write the codes to go get whatever the data is you need to process a return – and you’ve got the people who understand services writing that code, and not necessarily the people who know the best way to write that code to access the data. So what I’ve seen happen over and over is somebody will design a service, lets say to return a customer record, so they design this service to return a customer record, and typically services get rolled out originally due to an application that has a need for that particular service. So the first application that’s written – that is service oriented, that needs a service to return to customer – that group will typically write that services, with the help of the SOA architects. So they write this thing, they have an application that maybe 50 users use. The service goes out, the application’s using that service – it’s working fine for those 50 users – then, of course, under SOA guidelines another application comes along and needs a service that returns a customer.
Well under SOA the idea is to reuse that same service. So the second application hooks up and starts to use that returns a customer. Now instead of having 50 users of that service we suddenly have 150 users of that service. And then a third application rolls out, and a fourth applications rolls out, and all of a sudden a service that was originally working very well for those original 50 users all of a sudden have 500 or 1,000 or 10,000 users on it, and it’s not scaling well.
And over and over we’ve run into this, and as we start to look at these services, you realize: The data access code within that service actually is the bottleneck. The way that code is written, the way that data is accessed, causes that service not to scale as users are added to it.
So again, though the original service that rolled out for that original application might have been performing fine, as we start to scale up and add more and more applications, what we find out is that it wasn’t written optimally to scale up. And then we have to go in and fix those services and fix that data access code.
In places where I’ve seen SOA work and work very well – and I’m a data access expert so I concentrate on the data access code within those services – where I’ve seen it work really well is when you get in conjunction the data architect as well as the SOA architect to jointly design and implement those services. Because the data guys understand what it takes to build those services in a way that’s going to scale. And as the enterprise moves more and more into SOA, they’re going to have more and more users on those services. And where I’ve seen this be successful is where those service that access data were well designed up front, and understanding that it’s not just going to be the 50 users, eventually we’re going to have 1,000 or 5,000 or 10,000 users on there.
What do you see in the future in database connectivity?
Rob Steward:
Well I think the future is still standards based. We talk specifically in the book about ODBC, JDBC, and ADO.NET. These are standard data access APIs, which in fact most of the world uses to access their data. Now you may use something like a Hibernate or in Hibernate on top of JDBC or ADO.NET, but in a way those are really standard ways to access your data as well. And they’re still sitting on top of those JDBC drivers or those ADO.NET data providers. So I think that the future is still standards based because of all the benefits of using a standard: you run one API instead of learning one API for every database you want to access. But I also think there are some influences on the industry now that will change some of those standard ways to access data.
The biggest thing I see coming is cloud computing. In the cloud it’s still data. Let’s say you use SalesForce.com as your CRM, you still need to take that data and integrate it with your applications and with your data that lives within your firewall. So, I think one of the big influences on the future of connectivity is, how do we get to that access in the standard way to those cloud sources? If you build an application on the Google App Engine, or you build an application on Force.com, or you’re using some FaaS type application like the Salesforce CRM, how do we get to that data and integrate it with everything else in your enterprise? So I think that what we’ll see is the connectivity start to branch out and begin to access those types of forces in those standard ways so that you can plug it into all those applications that you know and love today. You may have a business intelligence application that can handle any OBDC connection, well you still you need that ODBC connectivity to those cloud sources, or a JDBC application, or your .NET applications. How do we get that data from those sources instead of the traditional relational sources into your application, into your enterprise? I think that’s one of the big things.
Now cloud computing and accessing data across the internet or cloud type interfaces has its own unique challenges that will need to be addressed by the connectivity vendors. Personally I’ve been spending a lot of time in that space lately, looking at what the unique challenges are. When you access something across the internet, there are some interesting performance implications when you think about the latency that’s going to be there that doesn’t really exist when you’re inside your firewall. How do you reduce those of number of round trips to get that data? And how do you reduce the number of web service calls that you make to get that data out of the cloud? I think that’s going to have a big influence on the direction of connectivity for the future. Those problems that have to be solved to make these types of architecture, i.e. using cloud source, perform well with what people are used to with absolute control inside your firewall. So that’s what I think is probably the biggest thing that I see that’s going to influence connectivity moving forward.< >< ><–>
Buy the Book
This book is for software architects, IT staff, DBAs, and developers to use in their daily work to predict, diagnose, and solve performance issues in their database applications.