Viewing entries tagged
engineering

1 Comment

Designing well-formed URIs for your REST Web API

apiThis is a follow up post to “Using HTTP Status Codes correctly in your REST Web API”, and “Using HTTP Verbs correctly in your REST Web API”. Continuing with the theme of RESTful Web APIs, I thought I would touch on the importance of the actual URL/URI part of the Web API. One way we can think of our Web API is as a database exposed through the HTTP layer. Like any good database, we must be careful when designing "the DB model of our Web API". The equivalent to a formal DB Model (ERD for relational databases), is the Resource Modeling that goes into the Web API. Let’s dive right in.

In REST lingo, URI endpoint are referred to as a Resource. From a data standpoint one can think of it as a piece of data describing an entity instance. Here’s some housekeeping rules to keep your Resources neat and clean:

  1. Avoid having your API start at the root domain. The two most common practices are to create a subdomain (such as http://api.mydomain.com), or to have a dedicated base URL different from the root domain (like http://mydomain.com/api). In doing so you ensure the longevity of the URIs assigned to your resource endpoints, and avoid potential collisions with Page URLs you may want to name in certain ways for the website hosted on your root domain. It also helps later on when versioning becomes an issue for a matured Web API.
  2. The resource endpoints should be plural and NOT singular, for example:
    • http://.../api/customers
    • http://.../api/orders
    • http://.../api/products
  3. Use identifiers to locate single elements in the resource URI. An identifier should ALWAYS resolve to the same single resource.
    • http://.../api/customers/123
    • http://.../api/orders/abc77uk
    • http://.../api/books/world-war-z
  4. When designing your resources, the golden rule is “Nouns are GOOD, Verbs are BAD”. Do not use verbs in ANY part of a resource URI.
    • If you see a verb as in a URI, like http://.../api/getcustomers or http://.../api/payfororder, a part of you should die.
    • Do everything in your power to change it, and educate the creators why using verbs is a bad practice.
  5. For non-resource URIs (yes, they do exist) make sure they are CLEARLY functional endpoints.
    • A simple way to do this is to have a special URL path for all functions such as:
      • http://.../api/func/calculate-fees?size=5&weight=8
    • Please don't use non-resource URLs as an excuse to build a RPC style API.
    • Other samples of functional endpoints could be:
      • http://.../api/func/calculateTax?state=fl&amount=10
      • http://.../api/func/randomNumber
      • http://.../api/func/getHash?input=ubniq2

 

As I'm writing this post I cannot think of anything else to add, but this is a good place to start when it comes to the design of your Resource Model for your REST Web API.

Happy coding!

1 Comment

2 Comments

Using HTTP Verbs correctly in your REST Web API

Following up after the earlier post titled Using HTTP Status Codes correctly in your REST Web API, here is one on using the HTTP Verbs from the W3C spec in the "right way" to have a clean REST Web API. Just like the HTTP status codes, there are many more verbs in the HTTP standard that, although they are in theory OK to use for your Web API, you can get by with just a few that helps to keep your API simple, and self-explanatory to its clients.

The full list of HTTP verbs from the spec can be found HERE, but we are going to focus on how to interpret the verbs and the actions that should be executed for each on in the context of a well-defined REST Web API. In the table there is also the result set that standard clients expect when they make requests with such VERBs. To better understand their proper use, we'll use a sample resource endpoint called Users, where the (you guessed it) "Users" of our app are exposed via our Web API.

Resource Sample GET (aka Read) POST (aka insert) PUT (aka update) DELETE (aka delete) PATCH (aka partial update)
api/users Action Gets a list of users Creates a user Batch Update Errors out Batch Update the users only with the attributes present in the request
Return List of users New user No payload, only HTTP Status Code Error HTTP Status Code No payload, only HTTP Status Code
api/users/123 Action Gets a single user Errors out Updates the user Deletes the user Partially updates the user only with the attributes present in the request
Return Single user Error HTTP Status Code Updated user No payload, only HTTP Status Code Updated full user object

 

And this is how you properly use the HTTP Verbs in a REST Web API.

Happy coding!

2 Comments

Comment

Using HTTP Status Codes correctly in your REST Web API

There are like a gazillion HTTP status codes maintained by the W3C and the Internet Assigned Numbers Authority (IANA) in their Official Registry for the HTTP specification. For RESTful Web APIs, even though in theory you could use any of them if the occasion deserves it, I've found that simplifying their use helps in making your API self documenting in nature and simplifies the cases your Web API clients need to consider. Here is my list of 'useful' HTTP Status Codes and how your clients can/should interpret them:

Code Description What it really means for a client of the Web API
200 OK It worked!
201 Created The resource was created OK!
304 Not Modified The client can use the cached version of this resource, because nothing has changed.
400 Bad Request The client did something wrong. The request has bad syntax or cannot be fulfilled.
401 Not Authorized The Web API is requesting the client to authenticate.
403 Forbidden The server understood the request, but is refusing to fulfill it due to restrictions in the client's authorization. Do not try again.
404 Not Found The resource was not found. There is nothing on that endpoint URI.
500 Internal Server Error The author of the service did something wrong. Something went bad on the server. (IOW: the Web API is fucked up)

 

I always include a similar table for my API guidelines page (note I didn't say documentation, cuz a well designed REST Web API should be self documenting)

Happy API designing!

Comment

2 Comments

Library Oriented Architecture

Library Oriented Architecture Icon

Library Oriented Architecture may sound like yet another buzzword in the software arena, but one that is not properly documented as of yet. It is not a common term and certainly far from the widely popular SOA or Service Oriented Architecture. Since there is no formal definition on the term LOA, I’m going to take a stab at it:

“Library Oriented Architecture defines the methodology for creating software components in the form of reusable libraries exclusively constrained to a specific domain ontology.”

What does it mean? Well, the part about ontology I’m not going to drill too deeply into that, in a nutshell “don’t confuse a contact with a user, they belong to different domain ontologies” (I wrote a different article about it HERE). In this piece we are going to drill down into the software piece, the separation of concerns, and how to define a practical framework to create things in the right places.

I caught the term LOA for the first time at SuperConf 2012 in Miami. Richard Crowley came to the stage and threw the new term at the crowd and got back a few long faces in return. Richard’s own words, when referring to the Library-Oriented approach:

Package logical components of your application independently – literally as separate gems, eggs, RPMs, or whatever- and maintain them as internal open-source projects… This approach combats the tightly-coupled spaghetti so often lurking in big codebases by giving everything the Right Place in which to exist.

His talk was very solid and I recommend everyone with a hard-core-techie-heart to spare a few minutes on it. You can find his reflections about developing interoperability HERE.

It caught my attention just by the name, because I’ve been saying, “It’s like SOA, but with libraries” for some time now. “It’s like SOA, but with libraries” always came up when I was trying to explain an architectural pattern for building solid systems and frameworks. In general, LOA is just a way of thinking about software engineering. Library Oriented Architecture defines the structuring of libraries for domain ontologies and it has 3 basic principles:

  1. A software library implementation and subject area expertise must be constrained to only 1 ontology domain.
  2. A software library that needs to use concepts and artifacts from a different ontology domain than the one it belongs to, must interface and reuse the library corresponding to that specific ontology domain.
  3. All domain specific software libraries must be maintained and supported with separate lifecycles.

Before we get into the weeds here, we ought to ask ourselves: Why in the world do we need a new term, or a new architecture, or a new anything in software engineering? Well, we don’t, but if you care to write badass apps and software systems that can evolve gracefully with time, this can turn out to be a very good road to take. For those who enjoy bullet points, here are some of the motivations to explore LOA a bit further:

  1. Simplify configuration management of distributed systems.
  2. Build highly reliable software systems because of the inherent properties of the LOA principles.
  3. Increase the Maintainability Index of your distributed systems and integration repositories.
  4. Minimize the risk of high coupling, especially for large systems (read Writing Elegant Code and the Maintainability Index).
  5. Bring developers up to speed orders of magnitude more quickly than a traditional system. Move developers and teams across libraries and domain ontologies and collaborate seamlessly.
  6. Spot bugs and zero-in on the problem almost instantly. There is something to be said about the amount of time a developer spends debugging.
  7. Maximization of the Bus Factor of the software engineering team.
  8. Information Systems build using LOA are technology-independent, and have the ability to entire libraries and domain implementations with localized impact and minimal upstream ripple effect.

Ok, enough reading, let’s see how this materializes in a diagram.

Library Oriented Architecture

Note that this is a specific implementation of Library Oriented Architecture for compiled libraries. You can adapt this to your own needs for scripted languages and even mix it around however you want. For the sake of simplicity, we’ll stick to this sample for now.

The second thing I want to note here is that the diagram is not describing how to implement LOA. It simply lays the foundations for a software engineering practice that happens to follow LOA principles. I’m sharing this because I think is useful and maybe someone will like it enough to offer some suggestions to improve it further.

I want you to notice a couple of things that are illustrated on the diagram:

  1. All 3 principles mentioned above are followed.
  2. The framework favors convention over configuration. Lib names, namespace naming and schema conventions are noted in the last column.
  3. You can clearly dissect the domains vertically and they span all the way from the data storage layer to the actual library implementing the domain specific logic.
  4. A library representing an ontology domain never interfaces with the data-sources, or even data access layer, from any other domain; instead it interfaces directly with the library representing that domain.
  5. Services are merely wrappers of libraries, with minimal or no business logic other than the orchestration of the libraries it needs in order to fulfill its function.
    • This is important because services are always tightly coupling their technology implementations and serialization mechanisms (WCF, ASMX, SOAP, REST, XML, etc.)
    • Part of the service implementation concern is usually dealing with this technology-specific fuzz that is unrelated to the actual business functionality the service is providing.
  6. Exception handing is bubbled up to the lib layer, such that we always get meaningful stack traces when debugging.
  7. Logging, as a cross cutting concern, should be manageable at all levels of the framework, however the domain deems necessary.
  8. If the implementations of the domain-specific libraries share a common framework, such as .NET or Java, they most likely have a superseded library set that extends each framework. For the example illustrated in the diagram, we called them framework infrastructure libraries, or Common Libs for short.

So, now that we have a framework for engineering our software needs, let’s see how to materialize it.

Suppose you are working on the next Foursquare, and it comes to the point where you need services that help you normalize addresses, and work with GIS and coordinates, and a bunch of other geo-location functions that your next-Foursquare needs.

It is hard sometimes to resist the temptation of the ‘just-do-it’ approach, where you ‘just’ create a static class in the same web app, change your Visual Studio web project to make an API call to 3rd party services, and start integrating directly to Google Maps, Bing Maps, etc. Then you ‘just’ add 5 or 6 app settings to your config file for those 3rd party services and boom, you are up and running. This approach is excellent for a POC, but it will not take you too far, and your app is not scalable to the point it could be with a Library Oriented approach.

Let’s see how we do it in LOA. In this world, it takes you maybe a couple of extra clicks, but once you get the hang of it, you can almost do it with your eyes closed.

  1. The Lib Layer
    1. Create a class library for the GEO domain ontology. Call it something like Geo.dll or YourCompany.Geo.dll. This library becomes part of your lib layer.
      • Deciding the boundaries of domain ontology is not an easy task. I recommend you just wing it at first and you’ll get better with time.
      • You need to read a lot about ontology to get an idea of the existential issues and mind-bending philosophical arguments that come out of it. If you feel so adventurous you can read about ontology HERE and HERE. It will help you understand the philosophical nature of reality and being, but this is certainly not necessary to move on. Common sense will do for now.
      • Just don’t go crazy with academia here and follow common sense. If you do, you may find later that you want to split your domain in two, and that is OK. Embrace the chaos and the entropy that comes out of engineering for scalability, it is part of the game.
    2. Define your APIs as methods of a static class, and add a simple[sourcecode language="csharp"]throw new NotImplementedException("TODO");[/sourcecode]
    3. Write your Unit Tests towards your APIs with your assertions (Test Driven Development practice comes handy here).
  2. The DAL Layer
    1. Sometimes your ontology domain does not need to store any data. If that is the case, skip to step 3, else continue reading.
    2. Create a new library for the GEO domain data access layer. Name it according to the convention you previously setup in your company and dev environment. For this example we’ll call it GeoDal.dll
    3. Using your favorite technique, setup the data access classes, mappings and caching strategy.
      • If your persistent data store and your app require caching, this is the place to put it. I say if, because if you choose something like AWS Dynamo DB where 1 MB reads take between 1 and 10 milliseconds, maybe you want to skip cache altogether for your ‘Barbie Closet’ app :)
      • Memcached, APC, redis, AppFabric, your custom solution, whatever works for you here.
      • You can also use your favorite ORM (NHibernate, Entity Framework, etc.) and they already come with some level of caching on them.
      • Bottom line, LOA does not have any principle preventing you from going wild here, therefore your imagination and experience are the limit.
  3. The Data Layer
    1. For this exercise suppose we need to persist Addresses, Coordinates and Google Maps URLs.
    2. I suggest you scope your data entities by your domain ontology. A way we’ve found to work quite nicely is to use named schemas on RDBMS and setup namespace conventions for your NoSql databases.
    3. For the GEO domain schema, we used SQL Server and created a named security schema called [Geo]. The use of named schemas makes it easy to avoid long table names, provides nice visual grouping of entities and a more granular security for your entities.

When it comes to data modeling, another technique I like to use is that of unaltered historical event data. Any ontology domain can be dissected into 3 purpose-specific data models: Configuration Data, Event Data, and Audit Data. They all serve very different purposes and in general we like to keep them in separate schemas with separate security, this way we’re not comingling concerns. Each concern has a different DAL library and potentially they all interface with the library representing the domain at the Lib Level. This post is already way too long, I’ll try to cover some more data modeling strategies in future posts.

Now that we have a clearly separated domain library for our GEO domain, we can decide to wrap with whatever technology specific services we need. This is very convenient because when you want to move your SOA stack to a different technology, you don’t have to re-write your entire domain infrastructure, only the service layer. More importantly, it allows for greater scalability, since it degrades gracefully and plays nicely with different frameworks and technologies. A well implemented Library Oriented Architecture can be said to be technology-agnostic, and that makes it a great SOA enabler.

That’s it for this episode folks. Send me your comments or emails if you are using Library Oriented Architecture, or if you have any suggestions on how to improve the methodology or framework.

Happy coding!

2 Comments

Comment

The Myth of the Genius Programmer

http://www.youtube.com/watch?v=0SARbwvhupQ&list=PLCB5CF9838389D7F6&feature=view_all

From Google I/O 2009, here are Brian Fitzpatrick, Ben Collins-Sussman about the fears of programmers and the fear of looking 'stupid'.

Comment

2 Comments

Ontology... what?

Today, in science, especially in information technology, the word ontology is a hot ride. In short, an Ontology is the  specification of a concept. The idea has grown almost to the point of becoming a buzz word for academics and professionals in the computer science field, and yet a big part of the industry ignores the subject for lack of friendly documentation or understanding that describes it in bogus terms, why is important and how it can change computing for the better.

The word appeared for the first time in the Oxford English Dictionary in 1989. Because it’s a relatively new word for English-speaking folks, the word itself it gets in the way of story it tells. In reality it has been around for quite some time in society.

The philosophical study of existence, “what is real and what is not”, it’s been around for centuries. We can find evidence of the questioning of nature and reality all the way back to the Pre-Socratic era, with philosopher Parmenides of Ela. Parmenides is most known for a poem he wrote called “On Nature” (read the poem here). The poem describes two different perspectives of the same reality, but it zeroes in one powerful idea, that no matter how different appearances of that ‘that it is’ (he calls it ‘the way of opinion’), the truth about ‘it’ does not change (‘the way of the truth’). In a nutshell, this is the first recorded attempt to formalize the realization that existential things don’t change regardless of the lexicon or language used to describe them. Many more developed their own thesis on how to define reality. Plato also made notable contributions to the field of Ontology, and his later disciple Aristotle put a dent in this universe with his works Categories and Metaphysics.

Why is this important today? Because all natural science fields that describe elements of the real world, already have their own ontologies, but this is not the case for Computer Science and Information Technology. Physics, Chemistry and Biology all have a very clear lexicon or dictionary that describes their scientific domains. But we have yet to define an Ontology that describes the world we present through software. When building information systems, different authors, developers and companies declare the same entity ‘that is’ not as the entity itself, but instead as one of its appearances. What we end up with is a lot of unnecessary repetition, corrupted data structures for entities and unnecessary computations made for the sake of mapping appearances that represent the same entity. A call for a Global Ontology has been the topic of many academics for a long time, and in many ways considered the holy grail of information sciences.

Mathematics, as the universal language, describes abstractions and logical reasoning to determine the truthfulness of an assumption. We do it with the use of specialized notation, like numbers and shapes that do not have a tangible form. No author, developer, company or human being in the planet will argue what the number ‘3’ represents. Mathematics provides the foundation for all Ontologies of any other domain definable by humanity. I couldn’t put it any better than Galileo Galilei:

The universe cannot be read until we have learned the language and become familiar with the characters in which it is written. It is written in mathematical language, and the letters are triangles, circles and other geometrical figures, without which means it is humanly impossible to comprehend a single word. Without these, one is wandering about in a dark labyrinth

Going back to Ontology in the Information Sciences, some questions remain unanswered:

  • What are the fundamental objects or structures we ought to define to represent the tangible and abstract concepts from a specific domain?
  • How can we successfully share and relate objects from different domain ontologies?
  • How can we define ontology structures in a way they are effective for operational and usable digital communications?

The biggest challenge in information science with respect of the use of ontologies, is that of establishing a base line agreement in the industry to use a common lexicon and vocabulary consistent with the theory specified by the a particular domain ontology. A Global Ontology would be defined as the aggregation of all domain ontologies, where a domain ontology represents the abstractions and tangible objects of part of the world or a specific knowledge domain.

Competition begs to be mentioned in these lines. The mammoths in the software industry have shown more interest in sticking their guns out for discriminator structures under the same ontological domain with their competitors. For example, Google Maps, Bing Maps and MapQuest all offer services in the GIS domain, yet they’ve decided not to share the same vocabulary and lexicon to name their GIS objects. Think about this for a minute, if these companies decided to share a global GIS schema, then their only discriminator really would be the quality of their service… but that’ll make it too easy for developers to switch sides; so they decide to give their own twist on unique vocabulary. The result is arbitrary mappings for “State”, “Province”, “StateProvince” and “Municipality”, each with multiple data types, sizes and formatting, ultimately adding layers of unnecessary complexity to such a simple concept like that ‘that it is’.

This is already too long of a post, so I’ll cut it short. Maybe in future posts, I’ll cover ontology more closely to engineering, and what you, as an architect, computer scientist, programmer, etc, can do to make your work  a much pleasant and rewarding one. My very good friend Leonardo Lezcano, has published many works in the healthcare domain ontology, with research and papers covering the Semantic Web and Semantic Interoperability. You can find some of his works HERE and HERE.

This is somehow a challenging topic to explain, and for the recipient to say “I get it” the first time around. I’ll feel good if I get a “I kinda got it” after someone reading this :)

2 Comments

Comment

WCF WTF!

It gets me when application frameworks tamper with core web concepts of precisely what they are trying to solve. If you have WCF services exposed through any of its different endpoints, you have to do the most ridiculous dancing to get something as simple as the HttpContext. WTF is up with that Microsoft!?!

There are like 10 different ways to access HttpContext and Request Headers, all weird in their own ways, none of them standard, and requiring the callers to add headers in different and specific ways:

  • There is HttpContext (or this.Context or HttpContext.Current): “Gets or sets the HttpContext object for the current HTTP request” This would be the obvious choice, but the WCF team needed to get COMPLICATED! To support this, you have to add extra magic and attributes to your service contracts (read here)
  •  Then we get fancy with something that is not quite the HttpContext the WEB knows and loves, but some new BS called OperationContext (OperationContext .Current). MSDN explains: “Provides access to the execution context of a service method”... but off-course!
  • Also HttpContextBase class according to MSDN “serves as the base class for classes that contain HTTP-specific information about an individual HTTP request”. So, you’d only think that HttpContextBase is the base class of HttpContext right? WRONG!

Hmmm, at this point you think this might be a brain teaser. There may be another 2-3 ways to access data from a similar concepts. If inspecting the HttpContext on the server side is a nightmare, managing headers and contextual http request elements on the client is even worse if your client is using the generated WCF contracts from VS. Here you are either setting something called ‘OutgoingMessageHeaders’ on an http request (like there is something that can be ‘incoming’ during a request), or you are implementing a custom IClientMessageInspector and altering the request before it is sent to the server: what is this the Police Academy (Inspector, pffff)? Why do I need to inspect a message I built? Or why am I forced to do this kind of crap?

This is so frustrating I cannot cope with the unnecessary layers of engineering and noise the WCF team threw over such a simple concept. I have nothing against new and different ways to solve problems, but please don’t call it the same as something that already exists and it’s well defined by the HTTP protocol specification (RFC 2616). PLEASE. DON'T.

I’ll try working around it with Rick Strahl’s post. If I keep having problems, I’ll move out to a different framework, implement my IHttpHandler, or downplay WCF’s capabilities.

Comment

Comment

SQL Convert CSV to Table. The Better Way.

A very common database operation is to transform a simple CSV string into a table so it can be used in more complex query operations. There is an overabundance of SQL functions online that perform such transformations. Here is one: [sourcecode language="sql"] CREATE FUNCTION [dbo].[fn_ParseCsvString] ( @CSVString VARCHAR(max) , @Delimiter VARCHAR(1) ) RETURNS @tbl TABLE ( s VARCHAR(128) ) AS BEGIN DECLARE @i INT , @j INT SELECT @i = 1 WHILE @i BEGIN SELECT @j = CHARINDEX(@Delimiter, @CSVString, @i) IF @j = 0 BEGIN SELECT @j = LEN(@CSVString) + 1 END INSERT @tbl SELECT SUBSTRING(@CSVString, @i, @j - @i) SELECT @i = @j + LEN(@Delimiter) END RETURN END [/sourcecode]

And you’ll use it like this:

[sourcecode language="sql"] select * from dbo.fn_ParseCSVString ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15', ',') [/sourcecode]

The problem is the way we use those functions and the fact that their performance order is linear to the number of comma separated values in the string. That works out ok for a small number of items, but not when you want to JOIN on a very large number of CSV. When you are using the CSV functions in a sproc or a query, you'd be writing JOIN statements to the result set from the function in order to filter a real table in your database.

Let's look at a case scenario: Suppose you have an Address table in your Geo schema. Your address table has a primary key 'Id' column and other columns that are irrelevant for the purpose of this article. If you wanted to query your Address based on CSV values using your function, you'd do something like this:

[sourcecode language="sql"]

DECLARE @csv VARCHAR(MAX) = '1,2,3,4,5,6,7'

SELECT a.Id FROM Geo.[Address] a JOIN dbo.fn_ParseCsvString(@csv, ',') tbl ON a.Id = tbl.s [/sourcecode]

Here is some benchmarking on the previous query (all in microseconds): 1 second = 1000 milliseconds = 1'000'000 microseconds

1000 CSV  ~ 0.5 secs – T1: 557'272 – T2: 408'185 – T3: 446'485

5000 CSV  ~ 2.3 secs  – T1: 2'376'159 – T2: 2'307'394 – T3: 2'370'191

10000 CSV ~ 4.6 secs  – T1: 4'699'332 – T2: 4'737'877 – T3: 4'641'049

You get the idea and it gets even worse when you do left joins on CSV. Even if your tables are indexed on the actual columns you are joining to your function, sequentially calculating those values and inserting them into a table will take some unnecessary time. So here is a solution that requires not a function but instead a store procedure and the reason for the store procedure is the need to run Dynamic SQL.

[sourcecode language="sql"]

PROCEDURE [dbo].[CsvToTable] @Csv VARCHAR(MAX) , @SchemaName VARCHAR(128) , @TableName VARCHAR(128) , @KeyColumnName VARCHAR(32) = 'Id' , @Debug BIT = 0 AS SET NOCOUNT ON ; BEGIN TRY DECLARE @mySql NVARCHAR(MAX)= 'select distinct ' + @KeyColumnName + ' from ' + QUOTENAME(@schemaName) + '.' + QUOTENAME(@tableName)

IF ( @Csv IS NULL OR @Csv = '' ) BEGIN SET @mySql = @mySql + ' where 1=0' END ELSE BEGIN

SET @mySql = @mySql + ' where ' + @KeyColumnName + ' in (' + @Csv + ')' END IF ( @Debug = 1 ) PRINT @mySql EXEC sp_executesql @mySql END TRY BEGIN CATCH --Your catch logic here --EXEC dbo.Error_Rethrow END CATCH

[/sourcecode]

And this is how you’d use it:

[sourcecode language="sql"] DECLARE @csv VARCHAR(MAX) = '1,2,3,4,5,6,7'

DECLARE @tbl TABLE ( Id INT ) INSERT INTO @tbl EXEC dbo.CsvToTable @csv, N'Geo', N'Address', N'Id'

SELECT a.Id FROM Geo.Address a JOIN @tbl tbl ON tbl.Id = a.Id [/sourcecode]

Here is some benchmarking on the new solution (all in microseconds): 1 second = 1000 milliseconds = 1'000'000 microseconds

10000 CSV ~ 0.2 secs – T1: 212'289 – T2: 183'635 – T3: 204'688

WOW! That's  1/5 of a second on average to query 10K records. MUCH MUCH better ah?!

Let’s check pros and cons of this solution.

Pros:

  1. There is nothing faster than this solution to convert CSV into a table. Period.
  2. You can configure additional indexes on the columns you want to lookup using a CSV approach and the indexed will have an impact on the speed of the CSV look-up; versus always having a linear execution order because the function is unaware of indexes.
  3. Simple to use.

Cons:

  1. It’s using dynamic SQL, and that brings all the negative things writing dynamic SQL has. I will recommend having a dedicated role to runners of this sproc so that users do not have permissions to run it. Try to always use this sproc through another sproc.
  2. The fact that is a sproc “feels” wrong, since by their nature sprocs can have side effects, although this one has none.  This is precisely the reason why a function does not allow for execution of dynamic SQL.

I think I can live with the feeling in trade of the unmatched performance this solution offers. A couple of things to note that were done to prevent SQL injection on the Dynamic SQL:

  1. Both schema and table names are provides as separate parameters of the sproc and they are put together in the code with the '.' separator.
  2. Both schema and table name are wrapped using quotename()
  3. When executing the Dynamic SQL, I'm using sp_executesql rather than just the plain EXEC() function.
  4. There is a @Debug parameter that prints the generated Dynamic SQL when debugging. This will safe you a couple of headaches.

Comment

1 Comment

Polymorphism... back to school

Almost anyone with some notion about programming or object-oriented techniques will be able to describe what polymorphism is and how to use it; I've discovered recently that very few can tell you how it actually works. While this may not be relevant to the daily work of many software developers, the consequences of not knowing and thus its misuses, affect everyone. We all know how to describe a car and even how to use it, but only mechanics know how to fix it, because they know how IT WORKS.

.

.

Polymorphism: What it is

Polymorphism is a characteristic or feature of programming languages. Programming languages either support it or they don’t. Since programming languages fall under the umbrella of sometimes substantially different paradigms, in this article I’m going to concentrate in polymorphism within the scope of Object Oriented Programming.

In OOP polymorphism is considered one of the basic principles and a very distinctive one. Most of the object-oriented languages have polymorphism among its many features. In a nutshell, polymorphism is best seen when the caller of an object with polymorphic implementation is not aware of the exact type the object is. Polymorphism interacts very closely with other features like inheritance and abstraction.

.

Polymorphism: What it is NOT

Contrary to an overwhelming number of articles found online, things like overloading polymorphism or parametric polymorphism are NOT object-oriented expressions of polymorphism, they are applicable to functional (declarative) programming languages. Signature-based polymorphism, overloading polymorphism, parasitic polymorphism and polymorphism in closures are meaningful on functional languages only (or at best, multi-paradigm languages with functional expressions and/or duck-typing languages, like Python or JavaScript).

Polymorphism is not boxing or unboxing, and is not the same as inheritance; instead and usually polymorphic manifestations occur are a consequence of inheritance, sub-typing and boxing.

.

Polymorphism: How it works

As its name suggests, polymorphism is the ability of an object to have more than one form, or more properly to look as it is of some other type. The most common manifestation can be seen with sub-typing, let’s see it through an example:

[sourcecode language="csharp"] public class Fruit { public virtual string Description() { return "This is a fruit"; } }

public class FreshApple: Fruit { public override string Description() { return "Fresh Apple"; } }

public class RottenBanana:Fruit { public override string Description() { return "Banana after 10 days"; } } [/sourcecode]

Now we can do something like this:

[sourcecode language="csharp"] Fruit[] fruits = { new FreshApple(), new RottenBanana(), new Fruit() }; foreach (var fruit in fruits) { Console.WriteLine("Fruit -> {0}", fruit.Description()); } Console.ReadLine(); [/sourcecode]

... and that would have the following output:

That’s all well and good and almost anyone will get this far. The question is HOW does this happens, HOW the runtime realizes that the method is must call is the one in the child classes instead of that one in the parent class? How deep this rabbit hole goes? There is no magic element here and you’ll see exactly HOW this occurs and WHY it is important to know about it.

These mappings of function calls to their implementations happen through a mechanism called dispatch tables or virtual tables (vTables for short). Virtual Tables is a feature of programming languages (again, they either support it or they don’t). Virtual Tables are present mostly on languages that support dynamic dispatch (C++, C#, Objective C, Java), meaning they can bound to function pointers or objects at run-time, and they can abstract from the actual implementation of the method/function/object until the very moment they are going to use it.

The vTable itself is nothing more than a data structure, no different from a Stack or a Hashtable for their purpose; it just happen to be the one used for the dynamic dispatch feature in programming languages. Some languages offer a different structure called the Binary Tree Dispatch as an alternative to vTables. As any data structure, they both have pros and cons when it comes to measure performance and cost. The point is that whether it is a vTable or bTree Dispatch, dynamic dispatching is bound using a data structure that is carried over with objects and functions to support this feature. Yes, I said “carried over”.

vTables are a hidden variable of objects. In .NET for example, all classes and structs inherit from Object and Object already has virtual functions like "ToString()" and "GetHashCode()", so every single object you create will have a hidden vTable private structure that is always initialized behind the scenes in the constructor call of every object. The purpose of these vTables being created all over the place is exactly to map the correct functions for polymorphic objects and functions. You can use Reflector to peek over an object at runtime (IL) and you'll find its vTable in the first memory location of each object's memory segment. The IL functions call and callvirt (polymorphism, yay!) will be used to call non-virtual and virtual methods respectively. There is simply no way of accessing an object's vTable directly from C#; this was done on purpose to minimize the security implications of it among other reasons. In C++, hmm...

[sourcecode language="cpp"] long *vptr = (long *) &obj; long *vtable = (long *)*vptr; [/sourcecode]

So the vTable of each object will hold a reference to each one of its virtual functions. Think of it, structurally, as a list of function pointers sorted in the same order they are declared. This serves its purpose on inheritance and polymorphism when a child object gets initialized regardless being assigned to a variable of type parent, the vTable of that object will be the one corresponding the actual inheritor. When virtual functions get called in the variable it will find the correct function in the child objects (actual object type) thanks to the function pointers on their respective vTables.

Virtual tables, inheritance and polymorphism in general have been criticized for their overhead in program execution. That is why one of the object-oriented programming principles is "Favor Object Composition Over Inheritance". Non-virtual functions never require the overhead of vTables making them naturally faster than virtual functions. Applications and systems with a deep inheritance architecture tend to spend a considerably large amount of their execution time just trying to figure out the root path of their virtual functions. This can become quite a performance tax if used without care, specially in older CPU architectures. Most modern compilers will have a trick or two under their sleeves to attempt to resolve virtual function calls at compile time without the need of vTables, but dynamic binding will always need of these creatures to resolve their destination paths.

And now you know how polymorphism works. Happy coding!

1 Comment

2 Comments

SQL - Select rows in chaotic order

Ok, another SQL trick. How to get a row set where the rows are ordered by a specific value sequence of a particular column?

This is not a very common problem, but if you happen to need something like this, you may scratch your head for a couple of minutes before you figure it out.

So, the challenge is the following, you have a table like the one below:

 

If you want to return all rows with ids 4, 5, 7 and 8  in order by any column, you can do so easily using order by [the_column] asc or desc; but if you want to return those same rows in a chaotic desired order is no so trivial.

Say we want the rows ids 4, 5, 7 and 8 and we want them such that the ids appear in the following order: 7, 4, 8 and then 5

Before looking at how to do it, challenge yourself and try to come up with that query.

Ok now let’s look at the solution. It involves the use of the CHARINDEX string function. CHARINDEX returns the starting index of parameter1 in the string given on parameter2, if parameter1 is a substring of parameter2. For example:

[sourcecode language="sql"] SELECT CHARINDEX ( '23', 'AA23BB') --returns 3 SELECT CHARINDEX ( 'abc', 'aapppabcbc') --returns 6 [/sourcecode]

That means we can do something similar with a set of numbers when treated as a strings, like so:

[sourcecode language="sql"] SELECT CHARINDEX ( '7', '7,4,8,5') --returns 1 SELECT CHARINDEX ( '4', '7,4,8,5') --returns 3 SELECT CHARINDEX ( '8', '7,4,8,5') --returns 5 SELECT CHARINDEX ( '5', '7,4,8,5') --returns 7 [/sourcecode]

Now you see how easy is toget order range using this style. Finally, we can create our query using this technique and return our rows in the order we want.

[sourcecode language="sql"] select * from Person where PersonID in (4,5,7,8) order by CHARINDEX(CONVERT(varchar, Person.PersonID), '7,4,8,5') [/sourcecode]

Happy coding!

2 Comments