The Data Access Handbook covers numerous causes for database bottlenecks, including hardware constraints, particularly with regards to memory issues. In this podcast Rob addresses issues that cause memory bottlenecks and explains how they can be overcome. The podcast lasts for 4:32. Read the rest of this entry »
Rob, it’s nice to be here with you today. I have a couple of questions regarding The Data Access Handbook. In particularly, in The Data Access Handbook, you write about the performance penalties associated with data encryption. Do you have any tips for improving throughput? We’re just talking about the data encryption topic.
Rob Steward:
Great question, Mike. What we’ve found in our lab when we do benchmarking of using strong encryption versus no encryption when it comes to data access (all that data that you’re requesting from the database that flows back across. Particularly when we talk about data encryption most people mean SSL – some sort algorithm through SSL to encrypt the data), what we’ve found in the past in our benchmarking is that the encryption itself causes about 100% overhead. In other words, if I were to fetch a result set and it were to take ten seconds, if I were to turn on and strongly encrypt through SSL that entire result set would probably take 20 seconds instead of 10 seconds. So it about doubles the amount of time involved in communicating with the database. So it’s a very good question, particularly in the regulatory environment today where a lot of government regulations are forcing people to encrypt a lot of data that they weren’t in the past.
What I can tell you is that overhead – that overhead about 100% that goes along with the data encryption – I have not seen any way to significantly reduce that. You’re kind of stuck. You’re going to take a penalty, and I think everybody knows that. But that penalty, from my experience, is about 100%. Now, what that does do is make it even more important that you follow the tips that we give you in The Data Access Handbook in order to reduce the amount of data and reduce the amount of roundtrips across the network. So, if I can take that result set and reduce it in size, the amount of time that it’s going to transmit that data to the client is going to be significantly reduced, obviously – even without encryption. But again, if you’re going to take your existing application, or you’re going to take some piece of code that you’ve got that is accessing data, it’s even more important that you make sure that it’s tuned to reduce the amount of data.
There are a lot of tips and tricks in the book that talk about how to reduce the amount of data that comes across the network, and I’ll refer you to the book. But particularly with encryption on, you really need to reduce the size of that data.
In this podcast Rob Steward explains the first steps to take when confronted with a database application performance issue. The podcast runs for 4:00. Read the rest of this entry »
In the first half of this two part podcast, Rob Steward explains the impact reducing network traffic has on database application performance. The podcast runs for 4:57. Read the rest of this entry »
The Data Access Handbook explains how virtualization is significant for database computing. In this podcast Rob Steward explains the implications virtualization has on the performance of database applications. The podcast runs for 6:30. Read the rest of this entry »
In this two-part podcast Rob Steward explains the difference between good and bad coding, and how good coding can benefit your ODBC, JDBC, or .NET applications.
Part 2, which runs 7:05, provides real-world examples of the impact that good coding can make
I’ll give you one other example that I like to use when I’m talking on performance, because it’s a pretty dramatic example. So I’ve been giving talks at conferences and seminars over the last 10 to 12 years now. And one of the tips that I’ve given out for a number of years now is controlling how you do transactions. So most people don’t realize that within all of those standards based APIs that I’ve talked about that you’re automatically in auto commit mode by default. What that means is, when I execute a statement, no matter how I execute it, it’s automatically committed. Now most databases don’t actually have an auto commit mode, but those standards based APIs require that by default, that’s the way you are. So I execute my insert into table, and it’s automatically committed.
Now, as apposed to manual commit mode, which is probably what most people think about when they think about transactions, where you begin a transaction, you do an insert or maybe multiple inserts or updates, and then actually issue the commit. All of these standards based APIs in order to make it easier on the programmers automatically commit everything for you. This is a good thing in most cases, but in some cases, that’s not what you want to happen. Obviously if you want to tie multiple inserts and updates within a single unit of work or within a single transaction, then you want to manually control it. You want to say start the transaction, issue the statement, and then issue the commit. And I think most people understand that.
But the thing that I’ve seen over and over is that you may want to do thousands of inserts or thousands of updates, but most people don’t understand the cost of auto commit mode is. So if we think about it, and think it through, what happens in the driver because of auto commit? If you’re in a database that doesn’t support an auto commit mode, which is most of them, most of them do not have a little switch that says ‘just automatically commit.’ Most of them require that what the driver does then in order to successfully do an auto commit is you execute a statement, what really happens underneath in the driver is it sends the statement across and it also sends a commit. So what that means is that for every statement that you’re executing, there’s actually two that are going across the wire. There’s your statement, and then there is a commit that goes behind it, even though you don’t see that. It’s required that the driver do that underneath you.
One of the things that a good driver or a good application will try to do is avoid network I/O. So now if we say we’re in auto commit and for every insert that we do we incur two network roundtrips. But not only that, but one of the most expensive things that we do on a database is commit. So what happens on a database when we do a commit, is its got all of your changes in memory and when you do the commit, when it responds with a success what it’s doing is guaranteeing that those changes will actually be persisted regardless of what happens next. So once you’ve got a success back from a commit or a rollback, the database is guaranteeing that no matter what happens, even if the power goes out or the machine goes down, that change is persistent. So what it has to do is, when you do the commit, is it has to flush some things out the disk. Not only does it flush them out the disk, but it tends to be non-sequential disk I/O.
So we all know that reading and writing from disk or to disk is one of the most expensive things that you can do, so we try to avoid that. When you do a commit you’re going to write to the disk. But not only that you’re going to write, typically because you’re updating the indexes and other things involved in a table, you’re going to write to very different places on the disk. So it’s non-sequential disk I/O.
The tip that we talk about in the book, and the thing that I wanted to point out on this podcast, is that if you don’t need to be in auto commit mode, then you can save significant amounts of time by switching into manual commit mode. And we go into the book exactly the code you write to do that. It’s also covered in all of the specifications for all of those APIs. But essentially, instead of going insert, insert, insert – you know, I’ve issued my four or five or 1,000 inserts or whatever it is – what I do is start transaction, move into a manual transaction control, issue all those inserts, and then do a commit.
Now I’ve given out this tip out for a number of years now. A couple of years ago I had this guy come up to me after a seminar, and he said, “Rob, I heard you say this tip a year ago, and I was doing this operation that actually involved doing this operation which actually involved doing a 5 million row insert into a DB2 table.” And he said, “It would take us 8 hours to do the insert. I went back after hearing you give this one tip on switching out of auto commit mode, changed my code, took me five minutes to do, I reran it, and it took 10 minutes.”
Now that’s an extremely good example, but most people aren’t doing 5 million row inserts all the time. But it does show you the kind of difference that one little coding change can make. They were doing the 5 million row inserts, they went from taking 8 hours to taking 10 minutes just by implementing this one tip that we cover in the book. That’s the kind of difference that good code and bad code can make. Both of those things work, but 8 hours verses 10 minutes is definitely the success or failure of that product or that project.
Those are the kinds of things that we talk about. Again, I would encourage you to read the book, because we go into a lot of these things and explain why they’re true and give you very specific code examples as to what kind of things are good verses bad.
In this two-part podcast Rob Steward explains the difference between good and bad coding, and how good coding can benefit your ODBC, JDBC, or .NET applications.
Part 1, which runs 7:05, focuses on coding best practices, while Part 2 provides real-world examples of the impact that good coding can make.
Great question, and there are large segments of the book devoted to answering that questions. Specifically, there are three chapters of the book, one for ODBC, one for JDBC, and one for ADO.NET that are devoted solely to good coding techniques. There are a lot of coding samples in there, there’s a lot of explanation of why one thing is good verses bad. I’ll give you a couple of examples of what we talk about in the book.
For example, with any of those APIs that I just mentioned, ADO.NET, ODBC or JDBC, there are multiple ways to execute a statement. For example, in ODBC, you could call a function called SQLExecDirect, or you can call SQLPrepare followed by SQL Execute. Now most people do not understand when is the write choice to use one or the other of those methods. There are advantages, particularly performance and scalability advantages, to picking one verses the other. In the book we talk about this, when do we choose one or the other, but I’ll walk through that one example here.
Prepare and Execute actually is very different than Exec Direct. I can execute the same statement either way. It’s going to work either way, but depending on what I want to do one is going to work much better than the other. So Prepare and Execute, two APIs that I have to call rather Exec Direct.
I should use Prepare and Execute anytime I have a parameterized SQL statement that I intend to execute more than once, and here’s why: Because when I call Prepare, what I actually do, I incur an extra round trip to the database server. So it’s an extra command that I send up there, I do the Prepare, it sends the SQL statement. What it tells the database server to do is generate an execution plan. Now an execution plan on a database server is really – when it goes through your SQL statement and you may do something like ‘select * from table where last name equals Steward.’ What an execution plan would look like for a statement like that is sort of a structure that says, okay, the tables involved in this SQL statement are whatever table names (employ table names), and the columns in the where clause are last names. And I know that there is an index built on last names, so I store that information to say okay, when I evaluate and I look for where last name equals Steward then I’m going to use X index. And it loads this thing in the memory and it holds onto it. So that when I actually go to execute that statement, it doesn’t have to parse my SQL anymore; it’s already done that, and it already knows which indexes to use, and which table to load into the memory, and which segments those indexes may need or may not need, and whether it needs to make temporary tables, and all kinds of things like that are in the execution plan. That takes some time to generate.
When you do a Exec Direct what that does, and the driver sends that across the Exec Direct, then the database will execute your SQL statement, evaluate it, generate the result set for it, and then toss out that execution plan. When I do a Prepare and Execute, once I’ve done the execute it generates the results, but it doesn’t toss out the execution plan. In an example like I just gave you, a pretty simple SQL statement, select * from table where last name equals Steward, even in that case the generation of the execution plan is actually going to take significantly longer than generating the results set.
So if I generate it once, that same statement, I’m actually better off because I saved the database roundtrip by using Exec Direct, because I don’t need the database to save that execution plan. I’m not going to execute the statement again. Now if I want to do that execution over and over, then I probably want to use Prepare and Execute. Because I do that initial prepare, the database spends all time generating the execution plan, but when it’s done executing my statement, it holds onto it. And it knows that when it sees that statement again, ‘I don’t need to go back through that logic and figure out the indexes and the tables and all that other stuff, I already know what it is.”
Now, I said at the beginning parameterize SQL. So instead of where ‘last name equals Steward,’ I would have, ‘where last name equals ?’. And then I would bind those parameters. I would bind that parameter value to some buffer in my application. In the case of JDBC or ADO.NET it would just be some string object that is bound to that ? or to that parameter within the SQL statement. Instead of literal value, so instead of using a literal value, so instead of saying where last name equals ‘Steward,’ I want to use parameter markers. Prepared statements in databases within their execution plans understand that there is no difference as far as what tables are involved and indexes, and the parsing of the SQL, and the data types and things involved, there’s no difference regardless of what that value is. So it can be Steward, it can be Smith, it can be Jones, it doesn’t matter, the execution of the SQL statement is the same. So if I’m going to execute it over and over, I want to put parameter markers in there, and I want to use Prepare and Execute. That’s just one tip. There are many of these things, and many tips covered within the book where we say, ‘here’s a choice that you can make, this choice is better for these situations and this other choice is better for these other situations.’
This podcast is the second of a two part series. In it, Rob Steward discusses what to look for in a highly effective database JDBC driver. The podcast runs for 6:42
Now in JDBC terms, JDBC the specifications formalized this concept of the architecture of the driver itself. They called it Type 1, Type 2, Type 3, and Type 4. So a Type 1 driver – which there was only ever one – was a bridge from JDBC to ODBC. Type 2 is what I just described where the driver, the Java piece of that driver sits on top of some native Windows or Solaris or Linux client piece that talked to the database. Type 3 was pure Java, and it talked to some intermediate server. So it may be just a JDBC driver that’s pure Java, but it had some server component that it would talk to that it would then in tern talk to the Oracle or the DB2 database, or whatever database. And then Type 4 – which is most common – is pure Java, opens up that TCIP socket to the database and talks directly to the database. In ODBC terms we call that wire protocol. In JDBC it’s a Type 4. So that architecture makes a big difference. In the ADO.NET world, we call that 100% managed. Something that is completely running within that CLR that opens up that socket to the database server.
Now this is a huge deal. This architecture, and the reason that I’m spending so much time in answering this question on that one particular subject, is that architecture – not only does it matter for say the versioning conflict that I talked about – if you’ve got a Type 2 in JDBC or anything that’s not completely managed in .NET, then you’re giving up one of the biggest benefits of those environments: the platform independence; the ability to, within your single process, be able to have all the assemblies or the components that you need for that application. If you have some dependence on the native operating system, then you’re giving up those big advantages. So you run into those versioning issues that I talk about, or you run into conflicts among shared objects. If you have a Type 4 JDBC or 100% pure managed .NET, you don’t have those issues. With ODBC you can eliminate a bunch of these conflicts because just eliminate a number of components that you need.
In addition to this versioning and compatibility issue, it actually makes a really big difference in terms performance and scalability. So if you think about it, in computer science we’re always taught in school to simplify things. The simpler the algorithm the better. It’s not just more elegant; it’s actually better performance.
The first class I took in college where I was dealing with data structures and sorting, the professor walked in and said, ‘okay, write a bubble sort algorithm.’ So we wrote a bubble sort, and we turned it in, and as soon as we turned it the professor said, ‘now that you’ve done that, never do that again.’ Now why did he do that? He did that because the algorithm is somewhat complex, but the reason that we were never supposed to write it again is because it was inefficient. We can write a much better binary search or something like that, which is actually much simpler but also performs significantly better than that bubble sort. This is the kind of thing we’re taught in computer science, and that’s the reason that we’re taught it: scalability and performance.
So if you have less layers and less complex interactions, what you end up with is better scalability and performance. For example, specifically, you may retrieve some data from the database and it may be buffered in that client layer. Well then it’s got to make a copy to hand up to that driver layer above it. So we may end up using twice the amount or memory that we need as apposed to if that driver was stand-alone and doesn’t have that other layer. Also, if you get a driver that’s wire protocol ODBC, Type 4 JDBC, 100% managed ADO.NET, that driver is built specifically to handle the API that you’re writing to. So if you’ve written an ODBC application, then that driver has the capabilities and the code written into it to handle ODBC. It doesn’t need to handle other things that are not ODBC. So if you have that other layer under there – which is the database client piece – which is built to handle more than ODBC or JDBC or ADO.NET underneath it, then there are complexities and codes in it that you don’t need. This causes it to not perform or scale as well.
In a nutshell I would say that you want to look for the architecture of the driver that really matters. You also want to look for experience. A company that writes a single ODBC driver of a single JDBC driver is not going to do as well at writing those drivers as a company that writes 5 or 10 or 20 of them. Why is that important? Well when you write a bunch of different drivers, you understand what ODBC or JDBC or ADO.NET applications need. You understand how they interact with the drivers better because you have a much broader area of experience. And you’re able to optimize those things within those drivers. So I would say the broad experience of the company that writes those drivers, as well as that architecture.
Another thing that I would look at is of the vendor who writes the driver. Is the driver a profit source for them? If you have vendor who say, gives the driver away for free, then they don’t have the incentive to write as good of a driver. It’s kind of the ‘you get what you pay for’ kind of a thing; absolutely true with drivers as well. And, as we just wrote a book on the subject: What kind of difference can those drivers make? Absolutely huge.
I would say that you want to look at the vendor; you want to look at what they make; you want to look at the architecture of those drivers. Just a few tips there on what I would look for in terms of a driver.
In this two-part podcast Rob Steward discusses what to look for in a highly effective database driver. Part 1, which runs 3:49, focuses on ODBC, while Part 2 will concentrate on JDBC.
Well of course I’ve spent my career developing, producing database drivers of all sorts: ODBC, JDBC, LAB and ADO.NET. And I’ve spent my career looking at different implementation, and looking at the way these things are written and how they interact with the databases. And I can tell you unequivocally that they make a huge difference in terms of your overall performance, reliability, and scalability of your data access applications.
There are several things that you need to look at when you look at those drivers from a cursory glance – or from a quick look – at a driver, what separates one from the other? The first thing I’d say is the architecture of that driver. ODBC was the first real standards based API that came out for which companies built drivers. That driver, what it does eventually is translated from that common API into some underlying thing that talks to a database or a data source. Data drivers do more than just that translation, but I’ll concentrate on that.
What it translates to underneath, when ODBC was originally released – or if you look at say JDBC when it originally came out and there were really just the beginning of JDBC drivers as well – the way they were written was you had this driver piece that sat on top of some other client API from the database vendors themselves. So let’s say in the case of Oracle you may have an ODBC driver, which is this layer that does this translation – it handles the error mappings, and it handles characters and translations, and all kinds of other things that it does. But it sits on top of what’s called OCI from Oracle – the client interface that Oracle has – and then that actually talks across the network to the database server, to the Oracle server. Same thing is true with the DB2, or with SQL server, or Sybase or Informix or any of the other ones we’re familiar with.
One big difference there are with drivers is whether that driver can talk directly to the database across the network, or whether it needs to sit on top of some other layer. What we call that in ODBC terms, we’ll call that a wire protocol driver. So if a driver doesn’t need any of those underlying client pieces from the database vendor, and it can directly open up the TCP/IP Docket or same pipeor somethingto the database, then we call that a wire protocol driver. It is a standalone piece that you can put with your application that can talk directly to the database. It doesn’t need some instillation of some other piece, which can cause you all kinds of versioning issues. Particularly we see it with Oracle where people say, ‘well I need a particular version of the ODBC driver, but I also need some particular version of this client piece, but I have another application that sits on the same machine that needs a different version of that client piece or a different version of the ODBC driver. And what you end up with is a big mess where it’s difficult in a Windows environment to use all of those things on the same machine.
Rob, what are some guidelines for data access and service oriented architecture, and why are both data experts and SOA experts needed to ensure success?
Rob Steward:
Well I guess the primary answer to that is that what we’ve seen over the last four to five years – as people have started to implement SOA environments and roll them out into production – is that the people who are in charge of those environments and in charge of these projects, which are always very large projects, are typically your SOA experts. They understand what it means to take some bit of business logic, encapsulate it into a service, and then how do you expose that? How do you represent that to all the different application groups that may use that service? Those are the people who are typically in charge of these projects.
So what we’ve seen happen over and over is that these guys design the service, but when they design the services what they’re not experts on data access. So they design them with service orientation in mind, but not necessarily with what is the best and fastest way to access the data within those services.
Most services out there, probably 75-90% of the services out there, actually access some kind of data. Well within that service you write the codes to go get whatever the data is you need to process a return – and you’ve got the people who understand services writing that code, and not necessarily the people who know the best way to write that code to access the data. So what I’ve seen happen over and over is somebody will design a service, lets say to return a customer record, so they design this service to return a customer record, and typically services get rolled out originally due to an application that has a need for that particular service. So the first application that’s written – that is service oriented, that needs a service to return to customer – that group will typically write that services, with the help of the SOA architects. So they write this thing, they have an application that maybe 50 users use. The service goes out, the application’s using that service – it’s working fine for those 50 users – then, of course, under SOA guidelines another application comes along and needs a service that returns a customer.
Well under SOA the idea is to reuse that same service. So the second application hooks up and starts to use that returns a customer. Now instead of having 50 users of that service we suddenly have 150 users of that service. And then a third application rolls out, and a fourth applications rolls out, and all of a sudden a service that was originally working very well for those original 50 users all of a sudden have 500 or 1,000 or 10,000 users on it, and it’s not scaling well.
And over and over we’ve run into this, and as we start to look at these services, you realize: The data access code within that service actually is the bottleneck. The way that code is written, the way that data is accessed, causes that service not to scale as users are added to it.
So again, though the original service that rolled out for that original application might have been performing fine, as we start to scale up and add more and more applications, what we find out is that it wasn’t written optimally to scale up. And then we have to go in and fix those services and fix that data access code.
In places where I’ve seen SOA work and work very well – and I’m a data access expert so I concentrate on the data access code within those services – where I’ve seen it work really well is when you get in conjunction the data architect as well as the SOA architect to jointly design and implement those services. Because the data guys understand what it takes to build those services in a way that’s going to scale. And as the enterprise moves more and more into SOA, they’re going to have more and more users on those services. And where I’ve seen this be successful is where those service that access data were well designed up front, and understanding that it’s not just going to be the 50 users, eventually we’re going to have 1,000 or 5,000 or 10,000 users on there.
This book is for software architects, IT staff, DBAs, and developers to use in their daily work to predict, diagnose, and solve performance issues in their database applications.