Posts Tagged ‘Data management’

SQL Server Master Data Services retooled

Posted by Alin D on October 25, 2011

Master Data Services (MDS), released with SQL Server 2008 R2 Enterprise Edition, is Microsoft’s offering for companies grappling with mounting stacks of all-important master data. The aim, as with all master data management (MDM) products, is to ensure the accuracy of information critical for strategic business decisions. But with users saying the interface is clunky and performance poor, whether MDS is meeting the challenge is up for debate.

Master data can provide companies with valuable information, including details about products, customers, processes or any other component essential to doing business. MDM systems capture and organize that data in an effort to synchronize analytical and operational systems, enforce compliance requirements and establish business processes that are auditable and repeatable.

According to Kirk Haselden, product manager at Microsoft, the goal of MDS is to provide organizations a way to consolidate domains of master data so downstream business intelligence (BI), analytical and line-of-business systems can consume the information.

However, most users find MDS lacking in a number of areas, according to John Radcliffe, research vice president at Gartner Inc., an analyst firm in Stamford, Conn. “It needs merge facilities, better data quality tooling, data stewardship applications, better UIs [user interfaces] for business users, better workflows for authoring and managing tasks.” Radcliffe also said the product must demonstrate scalability, availability and marketability in operational environments that are increasingly demanding.

Master Data Services: Innovation or regurgitation?

Mark Kromer, a SQL Server technology specialist for Microsoft, shared some of these concerns, saying that the initial release of MDS was mostly a repackaging of EDM+, an MDM system developed by software vendor Stratature. Microsoft acquired Stratature in 2007 as part of its strategy to deliver a unified MDM system that included SQL Server, Office SharePoint and other BI-related technologies.

Kromer said a primary reason to implement MDS is to provide data governance over an organization’s assets, enabling businesses to build data domains based on their own business models. Within these domains, businesses can create reliable, centralized data stores for analyzing their master data in order to create “a single version of the truth.”

That’s easier said than done. As Radcliffe pointed out, the biggest challenge with MDM products is not the technology, “but getting agreement on the necessary governance and organizational roles.” A product can do only so much without internal consensus, he said.

Yet even with a consensus, effective tools are needed to control the methods and processes used by data stewards as part of that data governance. Haselden said MDS addresses some of the key challenges around managing master data, and it solves the problem of how to grant secure access to the people who need it. “In the early days of the product,” Haselden said, “the team called this the human factor—basically, delivering the data to the business users in their context.”

But, Haselden said, the current release of MDS is not without its problems. For example, in feedback to Microsoft, users have reported that the UI is difficult to navigate. “Improved functionality is a priority,” he said. “We know about the issues and are working to improve them.”

Looking ahead with Master Data Services

The next version of MDS, which will be released with the upcoming SQL Server 2012, formerly called Denali, was designed to address many of those issues. According to Kromer, the new version will demonstrate that Microsoft has heard what users have been saying: The Web UI will be more intuitive and better geared toward business users. Page navigation will be improved through the use of the Microsoft application framework Silverlight. The workflow will be enhanced. Data staging will be simplified. And the product will be better integrated with SQL Server’s BI tools.

“The Denali release of MDS starts to make this a much more integrated ecosystem with the SQL Server product suite,” Kromer said, “which I think will go a long way in increasing the usefulness of MDS and the customer adoption rate.”

Microsoft released the revamped MDS features in a community technology preview (CTP) of Denali in July, and users who’ve worked with the new version report that they’re seeing improvements in the product. According to feedback, performance is better and the UI is more user-friendly, partly because it has been rebuilt in Silverlight.

Perhaps the most popular MDS feature in the Denali CTP is a new add-in for Excel. Haselden said Excel integration will further address concerns about the UI, improving the look and feel “to the extent that most business users will use Excel as the preferred experience.”

Radcliffe was not so sure. Although he agreed Excel might be beneficial to people accustomed to the program, he said it should be only one of several UIs offered in MDS. “For example, many competitive MDM products have graphical UIs that allow you to visualize, transverse and manipulate hierarchies,” he said. “I’m not sure that Excel is the best tool for that. And on the data stewardship side, again, people are looking for very visual analysis and reporting tools.”

Microsoft might not have any choice but to deliver the tools it has, given the number of MDM products crowding the arena, such as SAP NetWeaver MDM, Oracle Master Data Management Suite, IBM InfoSphere Master Data Management Server and Informatica MDM. As Radcliffe points out, MDS comes at a good price — it’s free — and that’s definitely worth considering. Even so, he said, MDS “is still a long way behind the MDM competition.”

A peek inside Master Data Services

MDS organizes master data into models. Each model contains a single domain of data—that is, data related in a meaningful way, such as product or customer information. Within each model, you can configure a set of objects that correspond to objects in the MDS database:

Entity. A set of related data that corresponds with a database table.
Attribute. An entity property that corresponds to a table’s column.
Domain-based attribute. An attribute linked to another entity’s attribute (often corresponding to a column’s foreign key).
Attribute group. A collection of attributes used to group data in the MDS interface.
Member. An entity record that corresponds to a row in the database table. Leaf andconsolidated members support explicit hierarchies. A leaf member is at the lowest level of the hierarchical structure. A consolidated member can serve as parent to a leaf member or other consolidated member. A collection member is part of a collection.
Attribute value. A specific data value in a member (corresponds to a row’s value in a database table).
Hierarchy. The hierarchical organization of members. Derived hierarchies are based on the relationships established by domain-based attributes. Explicit hierarchies are groups of data organized according to a user’s requirements.
Collection. A user-defined subset of members.
Business rule. Business logic stored as a SQL expression and applied to imported data.
Subscription view. A view created specifically for systems accessing the master data.

Think of MDS as a front-end application that lets you store and manage data within a SQL Server database. The application automatically takes care of table structures, relationships, constraints, permissions and other database components, letting you focus on the data itself.

Posted in SQL | Tagged: analytical and operational systems, community technology preview, Data management, data stewardship applications, line-of-business systems, Software, Web UI | Leave a Comment »

SQL Azure data synchronization

Posted by Alin D on January 13, 2011

In my previous article, “SQL Azure migration: To the cloud and back again,” I discussed the options for moving data between local SQL Server instances and SQL Azure. In this article, we will look at more complex data exchange scenarios, including data synchronization and refreshing data in SQL Azure while maintaining availability.

Implementing data synchronization typically requires some up-front analysis to determine the best process and most suitable tools and technologies for the job. Among other things, you need to consider the number of tables to synchronize, required refresh frequency (this could differ greatly among tables in the same database), application uptime requirements and size of the tables. In general, the larger the tables are and the higher the required uptime, the more work is required on your part to implement data synchronization so that it doesn’t interfere with the applications using the database.

One of the easiest approaches to data synchronization is to create staging tables in the destination database and load them with data from the source database. In SQL Azure, do this using SQL Server Integration Services or the bcp utility, as discussed in the previous article. Once the data is in staging tables, run a series of T-SQL statements to compare the data between staging and “master” tables and get them in sync. Here is a simple sequence I’ve been using successfully on many projects:

Use DELETE FROM command to join the staging and the master table and delete all rows from the master table that have no match in the staging table.
Use UPDATE FROM command to join the staging table and the master table and update the records in the master table.
Use INSERT command and insert into the master tables the rows that exist only in the staging table.

If you are using SQL Server 2008 or newer, utilize the MERGE statement to combine the second and third part into a single command. The MERGE statement is nicknamed UPSERTbecause it combines the ability to insert new rows and update existing rows in a single statement. So, it lends itself nicely for data synchronization.

Using the technique I described works well mainly for small to medium-sized tables because the table will be temporarily locked and therefore unavailable while these updates are taking place. But in my experience, it is a minor disruption that can be mitigated by synchronizing during low database usage times or by breaking up the updates into batches. The disadvantage here is that you have to implement custom code for each table.

I’ve also had success with a slightly modified approach, which takes away the need to implement a set of scripts for each table. After you load the staging tables, execute the sp_rename stored procedure and swap the names of the master and the staging table. This can be done very quickly, even on many tables. Run each table swap inside of a TRY/CATCH block and roll back the transaction if the swap does not succeed.

Another technique many companies use to refresh data and at the same time keep it available is to simply maintain two copies of the database. One database is used by applications and the other one is used by the load process to truncate and reload tables. Once the load is done, you can rename both databases and swap them so that the one with the fresh data becomes the current database. SQL Azure initially didn’t support renaming of databases, but that feature works now. As an alternative to renaming, store a connection string pointing to a SQL Azure database in a local database and have your data load process modify the connection string and point to the refreshed database, that is, if the data load completed successfully.

Another option is to use the Microsoft Sync Framework, a platform for synchronizing databases, files, folders and other items. It allows you to programmatically synchronize database via ADO.NET providers and as of the 2.1 version, you can use the framework to synchronize between SQL Server and SQL Azure. Describing all the features and capabilities of the Sync Framework is beyond the scope of this article. For more information visit the Microsoft Sync Framework Developer Center. One of the advantages of the framework is that once you get up to speed with the basics, you can write applications that give you full control over SQL Azure data synchronization. Among other things, you will be able to utilize its features, such as custom conflict detection and resolution, and change tracking. Both of those features come in handy if you need to implement bidirectional data synchronization.

Microsoft developers used the Sync Framework to develop and release an application called Sync Framework Power Pack for SQL Azure. You can download and install this application, but first install the Microsoft Sync Framework SDK. The application runs as a wizard. After you specify the local and the SQL Azure database, select the tables you want to synchronize. You can also specify how you want to handle the situation in which the same row is updated in both databases. Figure 1 illustrates how you can choose whether the local database or SQL Azure database wins the resolution.

Figure 1

In the last step of the wizard, specify whether it should create a 1 GB or a 10 GB database in SQL Azure. The tool creates the specified database in SQL Azure and sets up both databases with objects needed for synchronization. It will create INSERT/UPDATE/DELETE triggers on each synchronized table. Also, for each table it will create another table with the “_tracking” suffix. It also creates a couple of database configuration tables called scope_config and scope_info. As data gets modified in either database, the triggers update the tracking tables with the details that will be used by the Sync Framework when it’s time to synchronize.

The wizard also creates a SQL Agent job that kicks off the sync executable passing the appropriate parameters. All you need to do is schedule the job to synchronize as often as needed. The tool is not super fast, but it works fairly well and, in many cases, it can handle your synchronization requirements. The biggest drawback is that when you run the wizard, it insists on creating a new SQL Azure database, and it fails if you specify an existing database. So, if you ever want to modify what tables should be synchronized, drop the Azure database and start over.

Posted in Azure | Tagged: Data management, database management systems, Database trigger, Microsoft Sync Framework, Microsoft Sync Framework Developer Center, Sql, Sync Framework | Leave a Comment »

Introduction to Table Storage in Windows and SQL Azure

Posted by Alin D on December 15, 2010

In addition to SQL Azure, Windows Azure features four persistent storage forms Tables, Queues , Blobs and Drives. In this article we will focus on Tables.

Table Storage

Tables are a very interesting new storage method offered in Windows Azure and are Microsoft’s Azure answer to Amazon SimpleDB.. The SQL Azure database offers a wealth of features a modern database might be expected to provide. However for many purposes it is overkill. In the past almost any structured data had to go into the database and incur the performance penalty that entailed. With Azure Tables data which has a relatively simple structure (below we use the example of a Movies data, which is a listing of Movies with different attributes such as title, category, date etc but is not a very complex dataset).

Tables offer structured data storage for data that has relatively simple relationships. Data is stored in rows and as tables are less structured and don’t have the overhead of a full database it is massively scalable and offers very high performance. The interface to Azure Tables is the familiar .NET suite of classes, LINQ, and REST.

To create an Azure Table first create a storage service in the Windows Azure Developer Portal, then create a storage account and from create tables. Each table is scoped to its storage account so different tables with the same name can be used but scoped to different storage accounts.

Table Data Model

Tables are composed of rows and columns. For the purposes of the Azure Table Data Model, rows are entities and columns are properties. For an Entity a set of Properties can be defined but several properties are mandatory – PartitionKey, RowKey and TimeStamp. PartitionKey and RowKey can be thought of as a clustered index which uniquely identifies an entity and defines the sort order. TimeStamp is a read-only property.

Partitions

Table partitions can be thought of as units of scale within Windows Azure which are used for load balancing. Tables are partitioned based on the PartitionKey, all entities on the same PartitionKey will be served by a single server. Therefore selection of an appropriate PartitionKey is central to achieving scalability and higher throughput on Windows Azure. It is important to note that Azure implements throttling of an account when the resource utilization is very high, appropriate partitioning greatly reduces the potential for this happening by allowing the load to be distributed over different servers. The RowKey provides uniqueness within a single partition.
Partitions can be thought of as a higher level categories for the data with RowKeys are lower level data details. For example, for a ‘Movies’ table the PartitionKey could be the category of the movie such as comedy or sci-fi, RowKey could be the movie title (hopefully the combination of category and title would ensure uniqueness). Under load the table cold be split onto different servers based on the category.
For a write intensive scenario such as logging the PartitionKey would normally be a timestamp. In this instance there is a problem in partitioning as the write will also append to the bottom of the Table and partitioning based on a range will not be efficient as the final partition will always be the only active partition. The recommended solution to this is to add a prefix to the timestamp to ensure that the latest write operations are sent to different partitions.

In database design, tables should be split based on the data type. For example in an retailer’ s database, data of the type ‘customer’ with fields such as ‘name’, ‘address’ etc should be in a separate table to the ‘orders’ which only contains data on orders such as ‘product’, ‘order_date’ etc. However in Azure Tables these could both be efficiently stored in the same table as no space would be taken up by the empty fields (such as ‘order_date’ for a ‘customer’). To differentiate between the two types of data a ‘Kind’ property (column) can be added to each entity (row) which is in effect the table name if they were separated into two tables.

Table Operations

The operations are relatively similar to those of a conventional database – tables (which are analogous to the database) can be created, queried and deleted. Entities (rows) can have insert operations performed, delete operations, queries, and updated. There are two methods of update – Merge and Replace. Merge allows a partial update of the entity, thus if some of the properties of the entity are not given with the update they would not be updated (only the properties provided in the update are updated). Replace updates all the properties of an entity, if a property is not provided in the update it is removed from the entity. A newly introduced feature is Entity Group Transaction which is a transaction over a single partition.

Continuation Tokens

When a single entity(row is queried) the result is returned as with a database query. However when a range is requested Azure Tables can only return 1000 rows in a result set. If the result set is less than 1000 rows that result set is returned, if the result set is larger than 999 , the first 1000 rows of the result set are return together with a Continuation Token. The Table is then re-queried with the Continuation Token passed back to the Table until the query completes.
Continuation Tokens are returned for all results where the results is greater than one. They will also be returned if a query takes longer than 5 seconds (this is the maximum allowed by Azure after which the results are returned with a continuation token and the query must be rerun). Furthermore, continuation tokens are returned when the end of a partition range boundary is hit.

Optimizing Queries

Querying a table with a range is a very serial process, with result sets being sent to the client and continuation tokens being sent back for processing until the query completes. This structure doe not allow for any parallel processing. To take advantage of parallel processing the query should be split into ranges based on the PartitionKey, for example instead of

[cc lang=’sql’ ]Select * from Movies where Rating > 4[/cc]

Use

[cc lang=’sql’ ]

Select * from Movies where PartitionKey >= ‘A’ and PartitionKey < ‘D’ and Rating > 4
Select * from Movies where PartitionKey >= ‘D’ and PartitionKey < ‘G’ and Rating > 4

[/cc]

This enables the query to run in a parallel manner.

Similar to SQL Server, views can also be created to handle popular queries.

Be careful using ‘OR’ in queries. SQL Azure Tables do not do any optimization on these queries. It is optimal to split the query into several separate queries.

Entity Group Transactions

EGT’s offer transaction-like operations on an Azure table. Up to 100 insert/update/delete commands can be performed in a single transaction provided the payload is under 4MB.

Posted in Azure | Tagged: Azure Services Platform, Data management, database management systems, Microsoft, Microsoft Windows, Query languages, Sql | Leave a Comment »

Identity Property Range Checking in SQL Server

Posted by Alin D on December 15, 2010

The IDENTITY property for a column of a numerical data type is a frequently used method to achieve system-generated “uniqueness” for each row in a table. Such a column then in turn is a quite popular choice for the PRIMARY KEY constraint. Most of the times one would choose the data type int for the underlying column. However, the IDENTITY property can be defined on any integer-like data type and even on the decimal data type as long as the chosen scale is 0. By default SQL Server picks only the positive values unless you specify otherwise. So, when you opt to start with a negative seed value, this is perfectly fine for SQL Server and by doing so, you essentially double the range of possible values for most of the available data types. It may hurt one’s aesthetic experience, but if you take negative values into account, this gives you the following range of possible values:

tinyint	0	– 255
smallint	-32.768	32.767
int	-2.147.483.648	2.147.483.647
bigint	-2^63	2^63-1

If you decide to use a decimal data type such as decimal(38, 0) this gives you a range of -10^38 to 10^38-1 possible values, which, for almost any practical purposes should be more than enough.

But what can actually happen if you are about to exceed this range?

Let’s create a very simple test case:

CREATE TABLE dbo.id_overflow (
    col1 int IDENTITY(2147483647,1)
);
GO

The above script creates a new table dbo.id_overflow with only one column col1. This column is of type int with the IDENTITY property defined on it. The seed value is chosen to be the maximum value for the int type which is 2147483647. I just arbitrarily picked the int data type, I could have chosen any other eligible data type, the result would still be the same. So, when we now insert into this table, the very first insert statement is likely to succeed, while any subsequent one will fail with an arithmetic overflow error.

--This insert will succeed
INSERT INTO dbo.id_overflow DEFAULT VALUES;
--This insert will fail
INSERT INTO dbo.id_overflow DEFAULT VALUES;

(1 row(s) affected)
Msg 8115, Level 16, State 1, Line 2
Arithmetic overflow error converting IDENTITY to data type int.
Arithmetic overflow occurred.

So far, everything is as expected and when we look at the content of the table we only see the one row from the first insert.

SELECT
    *
FROM
    dbo.id_overflow;

col1
2147483647

(1 row(s) affected)

But what do you do in such a case? You can’t insert any more rows into this table. Even if there might be gaps in the sequence of the existing IDENTITY values, these gaps won’t be reused automatically. Once allocated, SQL Server doesn’t care about them and if an insert doesn’t succeed for whatever reason, this just freshly allocated value is gone.

Essentially, the only feasible solution to this problem is to choose a “bigger” data type. So, a very simplified change script to change the data type in our example to bigint would look like this:

IF OBJECT_ID('dbo.id_overflow') IS NOT NULL
    DROP TABLE dbo.id_overflow;
GO
CREATE TABLE dbo.id_overflow (
    col1 int IDENTITY(2147483647,1)
)
GO

--This insert will succeed
INSERT INTO dbo.id_overflow DEFAULT VALUES;

--Now change the data type to a bigger one.
ALTER TABLE dbo.id_overflow ALTER COLUMN col1 bigint;

--This insert will now succeed as well
INSERT INTO dbo.id_overflow DEFAULT VALUES;

SELECT
    *
FROM
    dbo.id_overflow;

If you run this batch, it will finish without an error and yield the expected resultset of 2 rows. But, as mentioned above, a change script in almost any real-world database would be much more complex. Indexes would have to be changed, referencing tables would have to be changed, code parts where the value of that column is assigned to a variable of type int, etc…

It is not hard to predict, that you’re in deep trouble when this table is one of your main tables in a database and is referenced by many other tables and/or in many places in your code.

I was bitten by a similar scenario not that long ago. Fortunately it was “only” a lookup table with an IDENTITY column on a smallint data typed column. And I was fortunate that I could simply reseed the IDENTITY value because the last +7000 inserts failed due to a misunderstanding between the developers of the calling application and me on how a certain parameter to a procedure should be used. But it still was enough trouble for me to decide to write a small check script that is now part of my weekly scripts and that gives me all the tables having such an IDENTITY column along with the last value consumed as well as the buffer I have left before I run out of values again. Here it is:

;WITH TypeRange AS (
SELECT
    'bigint' AS [name],
    9223372036854775807 AS MaxValue,
    -9223372036854775808 AS MinValue
UNION ALL
SELECT
    'int',
    2147483647,
    -2147483648
UNION ALL
SELECT
    'smallint',
    32767,
    -32768
UNION ALL
SELECT
    'tinyint',
    255,
    0
),
IdentBuffer AS (
SELECT
    OBJECT_SCHEMA_NAME(IC.object_id) AS [schema_name],
    O.name AS table_name,
    IC.name AS column_name,
    T.name AS data_typ,
    CAST(IC.seed_value AS decimal(38, 0)) AS seed_value,
    IC.increment_value,
    CAST(IC.last_value AS decimal(38, 0)) AS last_value,
    CAST(TR.MaxValue AS decimal(38, 0)) -
        CAST(ISNULL(IC.last_value, 0) AS decimal(38, 0)) AS [buffer],
    CAST(CASE
            WHEN seed_value < 0
            THEN TR.MaxValue - TR.MinValue
            ELSE TR.maxValue
        END AS decimal(38, 0)) AS full_type_range,
    TR.MaxValue AS max_type_value
FROM
    sys.identity_columns IC
    JOIN
    sys.types T ON IC.system_type_id = T.system_type_id
    JOIN
    sys.objects O ON IC.object_id = O.object_id
    JOIN
    TypeRange TR ON T.name = TR.name
WHERE
    O.is_ms_shipped = 0)

SELECT
    IdentBuffer.[schema_name],
    IdentBuffer.table_name,
    IdentBuffer.column_name,
    IdentBuffer.data_typ,
    IdentBuffer.seed_value,
    IdentBuffer.increment_value,
    IdentBuffer.last_value,
    IdentBuffer.max_type_value,
    IdentBuffer.full_type_range,
    IdentBuffer.buffer,
    CASE
        WHEN IdentBuffer.seed_value < 0
        THEN (-1 * IdentBuffer.seed_value +
          IdentBuffer.last_value) / IdentBuffer.full_type_range
        ELSE (IdentBuffer.last_value * 1.0) / IdentBuffer.full_type_range
    END AS [identityvalue_consumption_in_percent]
FROM
    IdentBuffer
ORDER BY
    [identityvalue_consumption_in_percent] DESC;

Since SQL Server 2005 it has been really easy to get this information. As you can see from the script, I have omitted the decimal(38,0) alternative. For me, a bigint column with a negative seed value is more than I would possibly ever need. I got into the habit of running this daily to monitor how many values we have left in the buffer before it blows up again and to get a feeling for “how urgent” it is to look at the inevitable changes to the database. Possible other variations would be to send out an alert when a certain threshold is reached but that I leave up to your fantasy.

Posted in SQL | Tagged: Computing, Data management, Databases, Microsoft SQL Server, Sql, SQL keywords | Leave a Comment »

Windows Management and Scripting

A wealth of tutorials Windows Operating Systems SQL Server and Azure

Pages

Top Posts & Pages

Latest Articles

Spam Blocked

Follow Blog via Email

RSS Feed

SCCM Tools

Twitter Updates

Alin D

Posts Tagged ‘Data management’

SQL Server Master Data Services retooled

Master Data Services: Innovation or regurgitation?

Looking ahead with Master Data Services

A peek inside Master Data Services

SQL Azure data synchronization

Introduction to Table Storage in Windows and SQL Azure

Table Storage

Table Data Model

Partitions

Table Operations

Continuation Tokens

Optimizing Queries

Entity Group Transactions

Identity Property Range Checking in SQL Server

Windows Management and Scripting

A wealth of tutorials Windows Operating Systems SQL Server and Azure

Pages

Top Posts & Pages

Latest Articles

Spam Blocked

Follow Blog via Email

RSS Feed

SCCM Tools

Twitter Updates

Alin D

Posts Tagged ‘Data management’

SQL Server Master Data Services retooled

Master Data Services: Innovation or regurgitation?

Looking ahead with Master Data Services

A peek inside Master Data Services

Rate this:

Share this:

SQL Azure data synchronization

Rate this:

Share this:

Introduction to Table Storage in Windows and SQL Azure

Table Storage

Table Data Model

Partitions

Table Operations

Continuation Tokens

Optimizing Queries

Entity Group Transactions

Rate this:

Share this:

Identity Property Range Checking in SQL Server

Rate this:

Share this: