Discussion:
sp_primarykey vs. unique clustered index
(too old to reply)
1dg618
2011-10-19 00:44:00 UTC
Permalink
If I have the following:

CREATE TABLE ir_sample_table
(
first_id INT NOT NULL,
second_id INT NOT NULL,
thrid_id INT NOT NULL,
some_date DATETIME NOT NULL,
junk VARCHAR(25) NULL,
junk_2 VARCHAR(25) NULL
)
LOCK datapages ON 'default' partition BY roundrobin 1
GO

EXEC sp_primarykey 'ir_sample_table', 'first_id', 'second_id',
'third_id', 'some_date'
GO

CREATE UNIQUE CLUSTERED INDEX ir_sample_table_uc_idx_1 ON
ir_ir_sample_table
(
some_date,
first_id,
second_id,
thrid_id
)
GO

I thought sp_primarykey could create a unique clustered index? But, I
can run both the commands and they work.

If I use sp_primarykey then it's just creating a constraint like a
foreign key enforcing referential integrity? It's not creating a
unique clustered index?
b***@msn.com
2011-10-19 15:53:16 UTC
Permalink
Post by 1dg618
CREATE TABLE ir_sample_table
(
        first_id         INT               NOT NULL,
        second_id        INT               NOT NULL,
        thrid_id         INT               NOT NULL,
        some_date        DATETIME          NOT NULL,
        junk             VARCHAR(25)           NULL,
        junk_2           VARCHAR(25)           NULL
)
LOCK datapages ON 'default' partition BY roundrobin 1
GO
EXEC sp_primarykey 'ir_sample_table', 'first_id', 'second_id',
'third_id', 'some_date'
GO
CREATE UNIQUE CLUSTERED INDEX ir_sample_table_uc_idx_1 ON
ir_ir_sample_table
(
  some_date,
  first_id,
  second_id,
  thrid_id
)
GO
I thought sp_primarykey could create a unique clustered index? But, I
can run both the commands and they work.
If I use sp_primarykey then it's just creating a constraint like a
foreign key enforcing referential integrity? It's not creating a
unique clustered index?
sp_primarykey does not create any index or constraint. It just
creates a
form of documentation in the system catalogs that clients can
reference
to understand the relationship of objects in the database.

See the first and second bullets under "usage"
http://infocenter.sybase.com/help/topic/com.sybase.infocenter.dc36273.1570/html/sprocs/X43109.htm

My impression is that many people don't bother using sp_primarykey
(and related sp_commonkey and sp_foreignkey) and simply rely on
the existence of actual unique indexes and constraints to be
self-documenting.
Derek Asirvadem
2011-10-19 19:01:10 UTC
Permalink
I thought sp_primarykey [did] create a unique clustered index?
No
But, I can run both the commands and they work.
If I use sp_primarykey then it's just creating a constraint like a
foreign key enforcing referential integrity?
No.
It's not creating a unique clustered index?
No.

sp_primarykey, sp_commonkey and sp_foreignkey are relics from the old
days, before SQL had the syntax to implement indices via PRIMARY KEY,
UNIQUE and FOREIGN KEY CONSTRAINTS. They were just declarations that
were placed in the catalogue, which report tools etc could pick up and
use. Nowadays, they read the CONSTRAINTS from the catalogue, and the
declarations are not relevant. They create no indices or CONSTRAINTS.

UNIQUE [CLUSTERED] and PRIMARY KEY [CLUSTERED] CONSTRAINTS do create
indices.

CREATE [UNIQUE] [CLUSTERED] INDEX creates an index.

You need to be careful not to accidentally create duplicate indices,
either by mixing CONSTRAINTS with CREATE INDEX statements, or by
CREATE INDEX with the same column group, but with another index_name.

Regards
Derek
1dg618
2011-10-19 20:27:24 UTC
Permalink
Post by Derek Asirvadem
I thought sp_primarykey [did] create a unique clustered index?
No
But, I can run both the commands and they work.
If I use sp_primarykey then it's just creating a constraint like a
foreign key enforcing referential integrity?
No.
It's not creating a unique clustered index?
No.
sp_primarykey, sp_commonkey and sp_foreignkey are relics from the old
days, before SQL had the syntax to implement indices via PRIMARY KEY,
UNIQUE and FOREIGN KEY CONSTRAINTS. They were just declarations that
were placed in the catalogue, which report tools etc could pick up and
use.  Nowadays, they read the CONSTRAINTS from the catalogue, and the
declarations are not relevant.  They create no indices or CONSTRAINTS.
UNIQUE [CLUSTERED] and PRIMARY KEY [CLUSTERED] CONSTRAINTS do create
indices.
CREATE [UNIQUE] [CLUSTERED] INDEX creates an index.
You need to be careful not to accidentally create duplicate indices,
either by mixing CONSTRAINTS with CREATE INDEX statements, or by
CREATE INDEX with the same column group, but with another index_name.
Regards
Derek
Thank you.
1dg618
2011-10-19 21:09:26 UTC
Permalink
Post by Derek Asirvadem
I thought sp_primarykey [did] create a unique clustered index?
No
But, I can run both the commands and they work.
If I use sp_primarykey then it's just creating a constraint like a
foreign key enforcing referential integrity?
No.
It's not creating a unique clustered index?
No.
sp_primarykey, sp_commonkey and sp_foreignkey are relics from the old
days, before SQL had the syntax to implement indices via PRIMARY KEY,
UNIQUE and FOREIGN KEY CONSTRAINTS. They were just declarations that
were placed in the catalogue, which report tools etc could pick up and
use.  Nowadays, they read the CONSTRAINTS from the catalogue, and the
declarations are not relevant.  They create no indices or CONSTRAINTS.
UNIQUE [CLUSTERED] and PRIMARY KEY [CLUSTERED] CONSTRAINTS do create
indices.
CREATE [UNIQUE] [CLUSTERED] INDEX creates an index.
You need to be careful not to accidentally create duplicate indices,
either by mixing CONSTRAINTS with CREATE INDEX statements, or by
CREATE INDEX with the same column group, but with another index_name.
Regards
Derek
Date
Author
Book
isin
junk1
junk2
junk3

primary key clustered index: Date, Author, Book

A clustered index is a good idea for example when a date is involved?
For example, if I have 29 million records going back to 1980 and I
want to find a book from 1997 by a particular author a clustered index
with the date first will quickly get to the 1997 records then quickly
to the author then it might have to scan for the book? I think, the
primary key clustered index will find the record pretty quick?

Now, if I want to find the isin and I know a date range the primary
key clustered index will quickly narrow the records because of the
date but may have to scan for the isin in that date range? Is there a
way to prevent a table scan for the isin? or a way to quickly find the
isin? Would a nonclustered index on isin help a query?

In plans, I see queries doing forward scans and table scans. I assume,
a forward scan uses an index to drill down to the records then table
scans for the records?

I know I'm asking a lot of questions, but this is pretty interesting
stuff. There's a ton of informaton out there, but it's not really
straightforward.
b***@msn.com
2011-10-20 18:52:32 UTC
Permalink
Post by 1dg618
Post by Derek Asirvadem
I thought sp_primarykey [did] create a unique clustered index?
No
But, I can run both the commands and they work.
If I use sp_primarykey then it's just creating a constraint like a
foreign key enforcing referential integrity?
No.
It's not creating a unique clustered index?
No.
sp_primarykey, sp_commonkey and sp_foreignkey are relics from the old
days, before SQL had the syntax to implement indices via PRIMARY KEY,
UNIQUE and FOREIGN KEY CONSTRAINTS. They were just declarations that
were placed in the catalogue, which report tools etc could pick up and
use.  Nowadays, they read the CONSTRAINTS from the catalogue, and the
declarations are not relevant.  They create no indices or CONSTRAINTS.
UNIQUE [CLUSTERED] and PRIMARY KEY [CLUSTERED] CONSTRAINTS do create
indices.
CREATE [UNIQUE] [CLUSTERED] INDEX creates an index.
You need to be careful not to accidentally create duplicate indices,
either by mixing CONSTRAINTS with CREATE INDEX statements, or by
CREATE INDEX with the same column group, but with another index_name.
Regards
Derek
Date
Author
Book
isin
junk1
junk2
junk3
primary key clustered index: Date, Author, Book
A clustered index is a good idea for example when a date is involved?
For example, if I have 29 million records going back to 1980 and I
want to find a book from 1997 by a particular author a clustered index
with the date first will quickly get to the 1997 records then quickly
to the author then it might have to scan for the book? I think, the
primary key clustered index will find the record pretty quick?
Now, if I want to find the isin and I know a date range the primary
key clustered index will quickly narrow the records because of the
date but may have to scan for the isin in that date range? Is there a
way to prevent a table scan for the isin? or a way to quickly find the
isin? Would a nonclustered index on isin help a query?
In plans, I see queries doing forward scans and table scans. I assume,
a forward scan uses an index to drill down to the records then table
scans for the records?
I know I'm asking a lot of questions, but this is pretty interesting
stuff. There's a ton of informaton out there, but it's not really
straightforward.
A clustered index is a good idea when you will have a lot of
queries involving a range of data ( <, >, between x and y)

To find a book by a particular author published in a particular year,
an index on author is likely to be far better. Most authors publish
very few book in their lifetime, but large numbers of books are
published each year.

If all the columns needed to find the isin are keys of an index,
ASE can get the isin from the leaf level of the index and avoid
reading any of the actual data pages. This is called a "covered
index scan"

forward and backward refer just to the direction of the scan.
Both table scans and index scans can be done forward or
backward. Backward scans are unusual, usually done due
to an ORDER BY ... DESC. The backward scan returns
values in descending order, so eliminates the need to
sort the results in a worktable. However, backward scans
increase the odds of deadlocks.
1dg618
2011-10-20 22:59:39 UTC
Permalink
Post by b***@msn.com
Post by 1dg618
Post by Derek Asirvadem
I thought sp_primarykey [did] create a unique clustered index?
No
But, I can run both the commands and they work.
If I use sp_primarykey then it's just creating a constraint like a
foreign key enforcing referential integrity?
No.
It's not creating a unique clustered index?
No.
sp_primarykey, sp_commonkey and sp_foreignkey are relics from the old
days, before SQL had the syntax to implement indices via PRIMARY KEY,
UNIQUE and FOREIGN KEY CONSTRAINTS. They were just declarations that
were placed in the catalogue, which report tools etc could pick up and
use.  Nowadays, they read the CONSTRAINTS from the catalogue, and the
declarations are not relevant.  They create no indices or CONSTRAINTS.
UNIQUE [CLUSTERED] and PRIMARY KEY [CLUSTERED] CONSTRAINTS do create
indices.
CREATE [UNIQUE] [CLUSTERED] INDEX creates an index.
You need to be careful not to accidentally create duplicate indices,
either by mixing CONSTRAINTS with CREATE INDEX statements, or by
CREATE INDEX with the same column group, but with another index_name.
Regards
Derek
Date
Author
Book
isin
junk1
junk2
junk3
primary key clustered index: Date, Author, Book
A clustered index is a good idea for example when a date is involved?
For example, if I have 29 million records going back to 1980 and I
want to find a book from 1997 by a particular author a clustered index
with the date first will quickly get to the 1997 records then quickly
to the author then it might have to scan for the book? I think, the
primary key clustered index will find the record pretty quick?
Now, if I want to find the isin and I know a date range the primary
key clustered index will quickly narrow the records because of the
date but may have to scan for the isin in that date range? Is there a
way to prevent a table scan for the isin? or a way to quickly find the
isin? Would a nonclustered index on isin help a query?
In plans, I see queries doing forward scans and table scans. I assume,
a forward scan uses an index to drill down to the records then table
scans for the records?
I know I'm asking a lot of questions, but this is pretty interesting
stuff. There's a ton of informaton out there, but it's not really
straightforward.
A clustered index is a good idea when you will have a lot of
queries involving a range of data  ( <, >, between x and y)
To find a book by a particular author published in a particular year,
an index on author is likely to be far better.  Most authors publish
very few book in their lifetime, but large numbers of books are
published each year.
If all the columns needed to find the isin are keys of an index,
ASE can get the isin from the leaf level of the index and avoid
reading any of the actual data pages.  This is called a "covered
index scan"
forward and backward refer just to the direction of the scan.
Both table scans and index scans can be done forward or
backward.  Backward scans are unusual, usually done due
to an ORDER BY ... DESC.  The backward scan returns
values in descending order, so eliminates the need to
sort the results in a worktable.  However, backward scans
increase the odds of deadlocks.- Hide quoted text -
- Show quoted text -
Makes sense. The primary key clustered index enforces the primary key.
I guess, a unique clustered index or a unique non-clustered index
would enforce the primary key too? I always put the primary key
clustered index on the primary key and ordered the columns from
distinct to least distinct... if that makes sense (I don't know
another way to say it?). For example, if my primary key had a date in
it, and users would do something like WHERE date <= ? and date >= ? it
makes sense to have the date first in the index followed by the rest
of the columns. The idea is to drill down into the records? On the
other hand it all depends. In the example above, the author would be
first in the index then the date.

Once one decides on the primary key, how should the columns in a
clustered index be ordered?

How should the columns in a non-clustered index be ordered?

Does the order of the columns in the WHERE clause matter? If my unique
clustered index is the following: date, first_id, and second_id should
my where clause be WHERE date = ? and first_id = ? and second_id = ?
or it doesn't matter? If I have WHERE first_id = ? and second_id = ?
and date = ? the optimizer will figure it out?
Derek Asirvadem
2011-10-21 03:38:25 UTC
Permalink
On Oct 21, 9:59 am, 1dg618 <***@gmail.com> wrote:

I have to say, you have good, specific questions that have excellent
application across the many environments.
Post by 1dg618
Makes sense. The primary key clustered index enforces the primary key.
I guess, a unique clustered index or a unique non-clustered index
would enforce the primary key too?
I always put the primary key
clustered index on the primary key and ordered the columns from
distinct to least distinct... if that makes sense (I don't know
another way to say it?).
I understand. You mean most or least no-of-distinct-values.
Post by 1dg618
For example, if my primary key had a date in
it, and users would do something like WHERE date <= ? and date >= ? it
makes sense to have the date first in the index followed by the rest
of the columns.
Yes. As detailed above, that is useful only if ? is a distinct date,
and not useful when ? is a year or year+month.
Post by 1dg618
The idea is to drill down into the records? On the
other hand it all depends. In the example above, the author would be
first in the index then the date.
We have crossed posts, I have provided further detail above, but we
are on the same wavelength. If any question remains, please ask.

The drill-down is something you do, interactively, the Optimiser does
not do that, it chooses an index, if it deems it useful.
Post by 1dg618
Once one decides on the primary key, how should the columns  in a
clustered index be ordered?
Note CIs exist, and Range Queries are supported by them, for APL only,
your question and the answer applies to APL only.

Answer 1. Generally, if no other criteria were used in the decision,
in the order you suggest, most distinct column first. That provides
the best data distribution, lest scope of interference, etc.

Answer 2. But life isn't that simple. We need simple rules, but we
have to be able to implement them with the full context in mind. We
cannot make simple decisions on table-by-table basis. Given that you
are using Relational principles, the table exists in the context of a
Relational database, it is not isolated and alone; it has Relations;
if it has an Identifying parent Relation, it will inherit an FK, which
demands to be used as the *first part* of its own PK. The table must
be considered in its full context.

Refer this example:
____http://www.softwaregems.com.au/Documents/Documentary%20Examples/
Order%20DM%20Advanced.pdf

If we were considering OrderSale, CustomerId is an inherited FK; it
would be silly to add a surrogate OrderSaleId; we achieve a very nice
reduced scope of interference (preventing two Customers' Order
searches/updates/inserts from colliding) by distributing the data, by
choosing CustomerId as the first column; now we don't need a
surrogate, only a sequence no within CustomerId, named OrderSaleNo, to
form an unique OrderSale identifier.

Likewise, when considering OrderSaleItem, (CustomerId, OrderSaleNo) is
the inherited FK. We may as well make use of it. If we use it to form
the PK, we need only to add a column that makes a line item unique.
No sense in adding a LineItemNo, PartId does that nicely.

Now in OrderSaleItem, We can consider your question more thoroughly.
Other considerations above have narrowed the options down to two
remaining: (CustomerId, OrderSaleNo, PartId) or (PartId, CustomerId,
OrderSaleNo) for the CI. Let's say we have tens of thousands of Parts
and thousands of Customers. From a purely distribution perspective
(which is isolated) the latter would be better. From any other
perspective, and certainly from taking the database as an integrated
and related set of tables, the former is better. For range queries,
that look up (CustomerId) or (CustomerId, OrderSaleNo) the former is
an outright winner. I have more of them than I have that look up
(PartId). Every lookup for an OrderSale (ie. paint the screen) and
its line items is a Range Query in OrderSaleItem.

In record filing systems that are placed in a container labelled
"database", the PKs are all Id[iot] columns or surrogate keys; they
are not "keys" in the relational sense, they are row identifiers. It
is an anachronism, they have regressed to the pre-Relational
paradigm. The "database" is robbed of its Relational power. They
have far more indices than in a normalised Relational database (but
they are all single-column identifiers). In such "databases", your
question does not apply. Range queries are not possible. Covered
queries are simply not hosted; only the most obscure are possible.
Post by 1dg618
How should the columns in a non-clustered index be ordered?
Since the considerations for CIs do not apply, the importance is far
less.

The main consideration is, the search arguments given to queries *that
are not range queries*. The second is uniqueness 9as detailed above).

OrderSaleItem is a good example to consider. (In IDEF1X standard an
Unique Index which is not PK is shown as "AK[i].[c]" where i the
index_no and [c] is column order.) Now the only other frequent access
path I need on that table is by PartId. Sure, the naïve modeller will
create an NCI (PartId), and think of nothing further.

But any cursory examination, any inspection of Plans, we reveal that
such an NCI is not being used, the udder in its present location is
useless. The trick is to make it useful, make it unique; so I create
(PartId, CustomerId, OrderSaleNo), knowing full well that my *queries*
only supply PartId and are seeking OrderSaleItem where they are used.
I do not have to perform a P&T assignment to determine that the
density of NCI (PartId) is 0.0 and thus useless to the Optimiser; that
(PartId, CustomerId, OrderSaleNo) is 1.0. And no surprise, my plans
suddenly start using that index.

The NCI order (PartId, OrderSaleNo, CustomerId) will not harm much, it
is just contrary to sensibility. I avoid such silliness, so I cannot
say for certan what the cost of it is. I don't know from assignments,
because I generally replace the entire indexing paradigm en masse, I
do not examine the effect of minor errors within major errors.
Post by 1dg618
Does the order of the columns in the WHERE clause matter? If my unique
clustered index is the following: date, first_id, and second_id should
my where clause be WHERE date = ? and first_id = ? and second_id = ?
or it doesn't matter? If I have WHERE first_id = ? and second_id = ?
and date = ? the optimizer will figure it out?
Yes. The order does not matter at all.

The quality of the value of each operand in the WHERE clause matters
very much: if it is a distinct value vs partial or wildcard; the ratio
of the value vs the density of the column; operators such as greater-
than vs equal; datatype mismatches; the use of IN demands a worktable;
etc, to name a few.

Regards
Derek
Derek Asirvadem
2011-10-21 07:07:59 UTC
Permalink
1dg618

It may not be obvious, click on objects you that you are interested
in, they should open, to provide further detail.

Regards
Derek
1dg618
2011-10-21 15:19:50 UTC
Permalink
How does APL, DPL, or DRL factor into the above? I'd make my table
APL?

CI (CustomerId, OrderSaleNo, PartId) --> the records would be in a
sorted order to keep it simple
NCI (PartId, CustomerId, OrderSaleNo) --> the records would be in
some order most likely using the clustered index?

so the foolish person would make the NCI (PartId)? --> there would be
no order therefore the optimizer would avoid using the index? Can you
explain why this would be bad. I think, most people would do it? I
think, in the past, I have done it too thinking it would help
performance, but nobody could explain why it's bad.

If I had NCI (PartId, CustomerId), it would be better than NCI
(PartId)... the optimizer may not use it, but it's a step towards
uniqueness.

One shouldn't worry about the overhead or do the two play well
together?

I'm starting to see the light.

I'm coming from the relational algebra/calculus side. Database design
seems to be drifting away from relationships into some fuzzy
relationship model, which isn't a relationship at all. I know it's
hard for database vendors to adhere to relational algebra because they
have to implement it meaning it's easier on paper than developing a
database server.

I always thought if your data is truly relational, you'd have amazing
performance?
Derek Asirvadem
2011-10-22 02:17:41 UTC
Permalink
Post by 1dg618
How does APL, DPL, or DRL factor into the above? I'd make my table
APL?
If you study the links I provided, you will get the background &
detail, the reasoning. Otherwise I have to re-type everything here.
After all the arguments are had, and the smoke clears, it boils down
to one thing:

APL tables are much more suited for Relational databases, range
queries, composite keys.

DPL/DRL tables are much more suited for record filing systems, no
range queries, surrogate keys.
Post by 1dg618
CI    (CustomerId, OrderSaleNo, PartId) --> the records would be in a
sorted order to keep it simple
Yes
Post by 1dg618
NCI  (PartId, CustomerId, OrderSaleNo) --> the records would be in
some order most likely using the clustered index?
Not sure what you mean. If the table has a clustered index (eg.
immediately above, then that is the explicit order of the rows. The
NCI does not change that. The NCI is a B-Tree. All *index entries*
are in the order of the index; here it is (PartId, CustomerId,
OrderSaleNo). The leaf level entries in the NCI hold the CI key; that
is what they use to find the row in the CI. This allows the CI to be
keep trim: when rows are deleted, they are moved on the page, and the
page is trim (no holes); the pages in the extents are kept trim,
etc. But these movements do not affect the NCIs, because they hold
the CI key, not a RowId.

So the rows (living in the CI) are maintained in CI order; the NCI is
in NCI, and containing the CI key.
Post by 1dg618
so the foolish person would make the NCI (PartId)? --> there would be
no order therefore the optimizer would avoid using the index?
Read again, I did not say that (PartId) has no order, it has, the
PartId. I said that that is not unique. Tens of thousands of unique
PartIds distributed across millions of OrderSaleItems. The uniqueness
is expressed as a density ratio; it is low, 0.0007. An NCI with
(PartId) alone; say 20,000 across 29,000,000 is not very useful. That
is what nebie (as opposed to foolish) DBAs would do "I am queries on
OrderSaleitem thatfrequently use PartId, so I will add an
NCI(PartId). Then they will spend years trying to figure out why the
NCI does not get used for queries that use (PartId).

But the seasoned DBA who has been through the hoops a few times, will
know that an NCI gets used, the Optimiser finds it useful, when the
NCI is unique, with a density of close to 1.0. So he will add columns
to the NCI to make it unique, knowing that he and the optimiser are
actually interested in just the first column (PartId,...). He will
make the NCI (PartId, CustomerId, OrderSaleNo). Those three cols, are
in the CI which is unique, he has changed the order for a set of
columns that are known to be unique.
Post by 1dg618
Can you
explain why this would be bad. I think, most people would do it? I
think, in the past, I have done it too thinking it would help
performance, but nobody could explain why it's bad.
See above. It is a common problem, more of a lack of knowledge than an
outright mistake. it is one thing that I routinely correct on every
P&T assignment, it guaranteed to get happy campers.
Post by 1dg618
If I had NCI  (PartId, CustomerId), it would be better than NCI
(PartId)... the optimizer may not use it, but it's a step towards
uniqueness.
Yes, exactly.

And NCI (PartId, CustomerId, OrderSaleNo), is definitely unique, and
guaranteed to be used. In this case, let's say the CI was used for
80% of the queries, then 20% would be table scans when there is:
- No NCI
- NCI (PartId)
- NCI (PartId, CustomerId)

That 20% table scans will change to say 18% index scans when there is:
- NCI (PartId, CustomerId, OrderSaleNo)
Post by 1dg618
One shouldn't worry about the overhead or do the two play well
together?
Not sure which OH you are talking about. The NCI is demanded, so the
consideration is between a single-column NCI and a 3-column NCI. The
overhead of an index is considerable. Once you have the index, the
difference between 1 and 3 columns is tiny (except when the columns
are fat/wide/nullable).

The two (CI and useful NCI) go hand in hand, left and right.

Just for your to consider the equivalent in a record filing system,
surrogate keys:
- the surrogate is always an additional column, an additional index,
say (OrderSaleItemId), and they always make that the Placement Index.
Meaningless, because it is the chronological order, which is what the
Heap has anyway.
- they get no data distribution
- REORG REBUILD is
- They need NCI_2(CustomerId) for most queries
- they still need NCI_3(OrderSaleNo),
- and NCI_4(PartId)
- a total of four NCIs,
- no clustering, no range queries
- the slightly faster inserts/deletes have to be balanced against the
demanded REORG REBUILD maintenance (basically offline), and additional
indices
- Plans will show that NCI(OrderSaleNo) is the least used, and they
can drop it
Post by 1dg618
I'm starting to see the light.
I'm coming from the relational algebra/calculus side. Database design
seems to be drifting away from relationships into some fuzzy
relationship model, which isn't a relationship at all.
Yes, IT has deteriorated badly, and we now have florists and
gyprockers doing IT work, with no qualifications. The better ones
think they have invented something new, and publish papers about it;
the rest think, wow, and use it. None of them realise that there is
an art to it, and that art is founded in an exact science.

There is no name for thi s"model", it is just un-normalised, often
fraudulently called de-normalised "for performance". Hilarious. De-
normalised implies that (a) Normalisation has been completed, (b)
correctly, and then (c) backed off. These record filing system never
had the normalisation, to be subsequently de-normalised.
Post by 1dg618
I know it's
hard for database vendors to adhere to relational algebra because they
have to implement it meaning it's easier on paper than developing a
database server.
Not really. SQL as a language is cumbersome and restricting, and the
progress is very slow. But the big vendors all have Extensions, and
they progress quite fast. DB2 *& Sybase have done a lot to improve
their optimisers, so that even bad queries get optimised well. They
have implemented what I would categorise as Extensions to Codd &
Date's Relational Model.

What we need is a language that accepts relational algebra as source
code, perhaps to sit on top of SQL, since that is the established
base, and it works through all the levels of the system, eliminating
the need to write drivers, etc.

So even without a successor to the RM, and without a modern Rdb
language, the Rdb space has progressed quite a lot.

The biggest problem is not either of those. It is that, in spite of
all that databases can be, the majority of databases implemented are
these tiny isolated record filing systems, that use about 5% of the
Relational capability and about 7% of SQL and the vendors extensions.
Eg. you find very few true corporate db models, they are container-
ised and split into 42 parts. Then the silly buggers create
additional DWs, the import/export utilities, replicate everything.
The technology has progressed but the implementations are still
pre-1984.
Post by 1dg618
I always thought if your data is truly relational, you'd have amazing
performance?
Abso-freaking-lutely. I guarantee it in my contracts. I regularly
get 10-12 times (1000 to 1200%) overall improvement in speed (without
changing the box, the o/s, the Sybase version). I get 100 to 1,000s
of times improvement on specific queries, due to correcting the
indexing.

Truly Relational implies Normalised, however, for the reasons stated
above, let me make that explicit. I use 5NF minimum, also identified
in the contract. That means no "update anomalies", zero data
duplication. I do not allow nulls in the Rdb. I use 6NF as needed, to
provide even more speed, functions and capability. of course
normalisation means rewriting the daatabase and thsu rewriting the app
and reports. Now that more people are producing garbage, and more
people are realising the value of normalisation, this type of
assignment is actually increasing.

Regards
Derek
1dg618
2011-10-22 23:44:39 UTC
Permalink
What is the difference between the CONSTRAINT / PRIMARY KEY CLUSTERED
in the CREATE TABLE...

CREATE TABLE my_table
(
one_id INT NOT NULL,
two_id INT NOT NULL,
date DATETIME NOT NULL,
source VARCHAR(8) NOT NULL,
CONSTRAINT my_table_pkc_1 PRIMARY KEY CLUSTERED ( date, one_id,
two_id, source ) on 'default'
)
LOCK allpages ON 'default' partition BY roundrobin 1
GO

and creating the UNIQUE CLUSTERED INDEX seperately?

CREATE UNIQUE CLUSTERED INDEX my_table_2 ON my_table
(
date,
one_id,
two_id,
source
)
GO

Are there performance hits?

Would adding a non-clustered index build off of the clustered index?

CREATE NONCLUSTERED INDEX my_table_3 ON my_table
(
one_id,
date,
two_id,
source
)
GO

Does the CLUSTERED index build the foundation for non-clustered
indexes?

I'm using APL.

I created the table and I'm loading data to test it. It seems to be
taking a long time.
1dg618
2011-10-26 12:55:14 UTC
Permalink
(1) - Clustered Index: date, first_id, second_id

(2) - Clustered Index: first_id, date, second_id

(3) - Clustered Index: second_id, first_id, date

date: low cardinality

first_id: med. cardinality

second_id: high cardinality

~700 million records in the table

Oracle doesn't care about the order in the unique clustered index.

Sybase seems to care about the order in the unique clustered index.
I'll report more about Sybase after today. My table of about 700
million rows is allpages lock, (constraint) primary key clustered
index (date, first_id, second_id).

Pretty darn quick no date range

Pretty darn quick when I have a date range in my where clause (e.g.
date >= ? and date <= ?)

My other table containing the same data is datapages lock and a unique
clustered index on first_id, second_id, date.

pretty darn quick no date range

Pretty darn slow compared to the other table when I have a date range
in my where clause (e.g. date >= ? and date <= ?)

I loaded the table where the date range query is pretty slow in sorted
order.

I'll try to figure it out today, unless someone knows?
Antony
2011-10-26 21:15:06 UTC
Permalink
I think you need to give us the exact SQL you are using, as it doesn't seem obvious (at least to me) what you're really comparing.
Also, can you tell us when the DOL table was last reorg'd? In a DOL table, a clustered index isn't truly clustered, and the efficiency of a range query will decrease over time as rows get deleted/forwarded etc.
Derek Asirvadem
2011-10-29 05:54:25 UTC
Permalink
Post by 1dg618
What is the difference between the CONSTRAINT / PRIMARY KEY CLUSTERED
in the CREATE TABLE...
CREATE TABLE my_table
(
        one_id         INT               NOT NULL,
        two_id         INT               NOT NULL,
        date           DATETIME          NOT NULL,
        source         VARCHAR(8)        NOT NULL,
        CONSTRAINT UC_PK PRIMARY KEY CLUSTERED ( date, one_id,
two_id, source ) on 'default'
)
LOCK allpages ON 'default' partition BY roundrobin 1
That is the ANSI/ISO/IEC syntax. It results in an index being created,
which is the same as:
* NOT NULL on all columns
CREATE CLUSTERED INDEX on my_table
ON (date, one_id, two_id, source ) on 'default'
Post by 1dg618
and creating the UNIQUE CLUSTERED INDEX seperately?
CREATE UNIQUE CLUSTERED INDEX UC_PK ON my_table
(
  date,
  one_id,
  two_id,
  source
)
Nothing. (I have corrected your index names for clarity)
Post by 1dg618
Are there performance hits?
No, because there is no difference.
Post by 1dg618
Would adding a non-clustered index build off of the clustered index?
I am not sure what you mean by "build off". The CI Key (your four
columns) forms the leaf level of the CI, the B-Tree is tiny. Rather
than keep the index physically separate from the data, the Sybase AP
CI CLUSTERS the index leaf-level with the data ROW; the heap is
eliminated. So you get a good, sparse, distributed index and data all-
in-one. It is kept very trim, as is, the rows within pages; pages
with extents, etc. Obviously the rows are in physical order by key.

Any NCI for such a CI, contains the CI Key as the leaf-level to the CI-
data. This allows the rows to be moved without having to update the
NCIs.

Whereas in a DPL/DRL table, the heap always remains, and the NCIs are
always physically separate structures. The leaf-level in those NCIs
are a ROW_ID, not a CI key (they cannot have CIs)
Post by 1dg618
CREATE NONCLUSTERED INDEX NC_1 ON my_table
(
  one_id,
  date,
  two_id,
  source
)
Does the CLUSTERED index build the foundation for non-clustered
indexes?
I think my expalanation above should answer that. If what you call
"foundation", is a single, CI structure, containing Index+data, very
fast, and sparse, then Yes.
Post by 1dg618
I created the table and I'm loading data to test it. It seems to be
taking a long time.
1. Well, use one of the faster techniques. SELECT-INTO, vectors, BCP,
etc.

2. "A long time" is a relative term, what are you comparing it with,
and identical DPL/DRL table ? Traffic at 5:00 pm ? at 2:00 am ?
Post by 1dg618
(1) - Clustered Index: date, first_id, second_id
(2) - Clustered Index: first_id, date, second_id
(3) - Clustered Index: second_id, first_id, date
date: low cardinality
first_id: med. cardinality
second_id: high cardinality
~700 million records in the table
Oracle doesn't care about the order in the unique clustered index.
That's because the Oracle clustered index is completely different to
the Sybase CI. The Oracle IOT is more like the Sybase CI.
Post by 1dg618
Sybase seems to care about the order in the unique clustered index.
Yes, I believe I already explained that.

Sybase actually does not care. But the order will make a difference
based on the query that you give Sybase to process.
Post by 1dg618
I'll report more about Sybase after today. My table of about 700
million rows is allpages lock, (constraint) primary key clustered
index (date, first_id, second_id).
SO I am not sure what your question is, it appears you have (1), but
you would like to know abut (2) and (3). Or have you been trying (2)
and (3) on your table.

If you have many interspersed inserts, do not forget to drop/create
the CI.
Post by 1dg618
Pretty darn quick no date range
Guaranteed, as per my explanation.
Post by 1dg618
Pretty darn quick when I have a date range in my where clause (e.g.
date >= ? and date <= ?)
That is a Range Query, yes.
Post by 1dg618
My other table containing the same data is datapages lock and a unique
clustered index on (first_id, second_id, date).
Ain't no such thing, as explained in detail. The syntax required may
confuse you into thinking you have a "clustered index". You don't.
It is a plain old NCI.
Post by 1dg618
pretty darn quick no date range
Don't know what the sargs are, but, Ok.
Post by 1dg618
Pretty darn slow compared to the other table when I have a date range
in my where clause (e.g. date >= ? and date <= ?)
Guaranteed, as well. As I have detailed in my explanation, DPL/DRL
tables cannot support range queries the way APL/CI can.

SET STATISTICS IO ON, and get an idea of hwat it is doing.

Also, read this thread again, please.

Also, read Part A of the second linked doc above.
Post by 1dg618
I loaded the table where the date range query is pretty slow in sorted
order.
It is getting a bit confusing. Perhaps you can give the tables stable
names, so that we know which one you mean. I presume you mean the DPL
table, with ~700 million rows. You may need to clear that up, build
the Placement index without SORTED_ORDER.
Post by 1dg618
I'll try to figure it out today, unless someone knows?
Figure what out ? The question is not stated.

Regards
Derek

Derek Asirvadem
2011-10-21 02:00:08 UTC
Permalink
On Oct 20, 8:09 am, 1dg618 <***@gmail.com> wrote:

You are asking a very important question which relates to many aspects
of table/index design, and the resulting performance of queries, so
let me be clear.

SomeTable (
Post by 1dg618
Date
Author
Book -- Title
Sex
ISIN
Post by 1dg618
junk1
junk2
junk3
PRIMARY KEY CLUSTERED (Date, Author, Book)
)
LOCK ALLPAGES
Post by 1dg618
A clustered index is a good idea for example when a date is involved?
First a qualifier. For the first 13 or so years of Sybase ASE, we had
only tables which we now know to be APL lockscheme, and Clustered
Index meant one thing: the leaf level of the index is *clustered* with
the data row. That resulted in the data being maintained in index
order; one less physical fetch on every access; a very fast Index;
pages & extents that were kept trim; supported Range Queries (which is
your question); etc.

With the advent of ASE 12, they introduced DPL/DRL lockschemes. These
have no Clustered Index. They have a Placement Index which is really
a Nonclustered Index (separate physical DataStructure) with two
additional qualities: it *initially* places the rows in index order
(which cannot be maintained under normal insert/update/delete, row
expansion, etc), and lives on the same segment as the data heap, in
the hope that the PI will be "close" to the heap (which is moot
because, given concurrent inserts to all table and their indices, the
extents for all objects are interspersed). These tables introduce a
new form of fragmentation, and you need to REORG them weekly
(currently not an online operation). There is no clustering of index
+data. It does not support Range Queries. The advantage is that the
granularity of the lock is smaller (but depending on other aspects of
your design, you may end up with much more locks).

For some reason unknown to me, and not explicitly declared by Sybase,
they did not change the syntax to provide for the Placement Index,
thus to refer to them you have to use "clustered index". This
obviously creates a lot of confusion. It creates upset (just two eg)
when people find out that their carefully designed "clustered index"
does not support the range queries that they designed it for; or that
the rows are not in index order and ASE is performing table scans when
they explicitly did everything to invoke an index scan. In the
manuals, sometimes they are called Placement and other times
"clustered".

I have been a Sybase performance specialist for 20 years and a Sybase
Partner for 13 years, ending last year. I informed Sybase of this
misrepresentation and the confusion it causes, which is ongoing.
Nothing has come of it. In fact the evangelists heavily push the
notion that a Placement Index is "the same as" a Clustered Index. I
got barred from the company newsgroup (which only have employees
posting now, all the independent contributors are gone). Bret is a
venerable and respectable Sybase engineer, but he is an employee, you
will not get this clarity from them. For my part, as a consultant and
trainer, it does not work to use confused or dishonest terms, and
forever be explaining that sometimes an apple means an apple but other
times, it means two oranges. I cannot participate in the ongoing
misrepresentation, or imply implications which are know to be false,
therefore:
- I use the term Clustered Index for Clustered Index
- I use the term Placement Index for Placement Index
- Since there is no such thing as a "clustered index" for DPL/DRL
tables, I do not use that term for such tables
- I do not use confused terms, or propagate the confusion in the
manuals

Back to your question. As long as you are using a Clustered Index,
ie. your table is APL not DPL/DRL, then yes, it supports Range
Queries. If your table is DPL/DRL, then despite the term "clustered"
in the command, you do not have a Clustered Index, and therefore there
is nothing to support Range Queries.

In your question, if the Date is involved, or the Date + Author is
involved, it is a Range Query.

You are also implicitly dealing with another important aspect. Data
distribution. You have done a nice job, and come up with a typical
Relational or Natural Key. APL tables and their Clustered Indices are
designed for that. The data is the Naturally distributed across the
CI (which *is* the data); inserts are spread; there is no hotspot such
as there always is with an sequentially increasing single-column
"key"; select operations are nicely constrained to small ranges, and
do not interfere with each other; etc; etc; etc. But the tables/CIs
are for the old school who produce real Relational databases; DPL/DRL
are designed for, and more appropriate for, the record filing systems
placed in database containers.
Post by 1dg618
For example, if I have 29 million records going back to 1980 and I
want to find a book from 1997 by a particular author a clustered index
with the date first will quickly get to the 1997 records then quickly
to the author then it might have to scan for the book? I think, the
primary key clustered index will find the record pretty quick?
Yes, for the genuine Clustered Index (ie. not for any Nonclustered
Index, including Placement Index).

I would recommend a small change, which will give you even better data
distribution, and therefore even better performance (the query you
identify, as well as similar queries), selectivity, reduced scope of
interference, etc:

PRIMARY KEY CLUSTERED (Author, Date, Book)

You may notice that this supplies Bret's suggestion as well, without
an additional NCI.
Post by 1dg618
Now, if I want to find the isin and I know a date range the primary
key clustered index will quickly narrow the records because of the
date but may have to scan for the isin in that date range? Is there a
way to prevent a table scan for the isin?
Again, that is applicable only where you have a genuine CI. Notice
that you are naturally using the word "range".

Many people expect that the Optimiser works like that. Wish that it
would. The ASE 15 Optimiser has been totally re-written and it is a
much more intelligent beast than it used to be. Explaining its
operation cannot be done adequately in this media. Suffice to say
that all those sorts of considerations are made, as well as many, many
others.

To answer your question, given no other index (your CI, not my
recommended CI), if the no of rows identified *by other means* by the
Date was small enough, yes, it would avoid a table scan. To be clear,
if your date was just the year portion, no; if year+month, likely; year
+month+day, definitely. My recommended CI is useless for the purpose
of your example query.

The Optimiser considers various things, notably Statistics, and makes
a decision to *either* use an index to find the rows *or* to find the
starting position for a Range Query (CI only) *or* use a table scan.
It does not use combinations of that. NCIs, which good for finding a
few qualifying rows, cannot be used to *subsequently* range scan, no,
if it is going to do a range scan, the starting point has to be
identified via the CI. A "table scan" scans the entire CI leaf/data
level via the PageChain (APL) or the entire Heap via the
ObjectAllocMap:AllocUnit:AllocPage:Extent method (DPL/DRL), until the
criteria is satisfied (eg. existence check, TOP n, whatever).

"Covered Query" is of course a different thing. All index types can
be used for a "covered query", the intent being, the query can be
satisfied by the index columns alone, the data pages are not required.
Post by 1dg618
or a way to quickly find the
isin? Would a nonclustered index on isin help a query?
Absolutely, that is exactly what they are there for. And it would
eliminate the consideration of a range query.

Now please realise, you know what you are doing, ISIN is unique. You
will fetch the row with one single access (navigate B-Tree and find
one row satisfying the criteria).

The reason I state this qualifier is, many people just add NCIs to
"help" boost slow queries, and wonder why they are not used. Non-
unique NCIs are as useful as tits on a bull. It has nothing to do
with the Optimiser (which gets blamed), this applies for all products.
It has everything to do with understanding your data. The more
"unique" an index is, the more useful it is, and the more frequently
the Optimiser will choose it over a table scan. The less "unique" an
index is, the less useful it is, and the less frequently the Optimiser
will choose it over a table scan. Indices that are not "unique" are
simply a waste of space and processing power on every single insert/
delete.

- NCI on (Sex) -- useless to queries using Sex as one of their
sargs
- NCI on (Author) -- depends: useless for Celko; useful for Authors
who have published a few books
- NCI on (Date) -- depends, refer above
- NCI on (Book) -- depends: useless for books on "... for Dummies",
cookery, sex or technology; useful for books on history, the arts etc
- NCI on (Book, Sex) -- useless
- NCI on (Book, Author) -- Unique
- NCI on (Book, Date) -- useful (but probably never requested)
Post by 1dg618
In plans, I see queries doing forward scans and table scans. I assume,
a forward scan uses an index to drill down to the records then table
scans for the records?
Bret has provided the general answer re direction of scans.

If you are examining Plans, then I better give you a specific answer
re "drill-down". That is a data mining concept, and implies a form
of interaction with the *levels* of the data, not an index B-Tree
navigation concept, but I will apply it as I think that is what you
mean.

CI: Note the data pages are capable of being Range Scanned. It
navigates the B-Tree to find the first qualifying row (using the first
or first+second, etc, columns of the CI), then scans the data pages.
This is not a "table scan", this is a Range Scan (it is not identified
as such in the Plan; you will see "forward scan").

NCI or DPL/DRL: It navigates the B-Tree to find all (previously
identified as small in number) rows. Period. If the no of qualifying
rows was not small, the index is useless. There is no table or index
of heap scanning after that.

The "drill-down" concept (you are correctly intuiting what is
required, what the Optimiser will require) is important, you need to
confirm that the optimiser is or is not doing what you expect it to
do. Look for "

Unfortunately I do not have a error to demonstrate, I can only
demonstrate a correct example and provide an example error. I have a
CI with four columns, and *here* the Optimiser has chosen it
perfectly:
SCAN Operator (VA = 5)
FROM TABLE
tempdb..sg_HS_Statistic_CellValue
C_VAL
Using Clustered Index.
Index : UC_PK
Forward Scan.
Positioning by key.
Keys are:
iTable ASC
iColumn ASC
iPartition ASC
iCell ASC
If it chose only three columns, or two, then I should chase that down.

The worktables it creates are very important, especially if the source
table is large. For obvious reasons, the worktables are always APL.
It creates CIs for the worktables, which are more correct when your
tables are correct. Look for "Key Count:". If the worktable created
for the above Plan fragment (example only, this query did not create a
worktable) identified "Key Count: 1", I would know that it did not
infer a four-column CI, or find my four-columns useful, and I would
chase that down.
Post by 1dg618
I know I'm asking a lot of questions, but this is pretty interesting
stuff. There's a ton of informaton out there, but it's not really
straightforward.
That is right. The web is full of shallow information and
misinformation. The vendors websites naturally suit their commercial
interests, they cannot be relied upon to provide unbiased or
independently helpful information. The manuals are pathetic, written
by schizophrenics (people obsessed with concentrating on seeming
contradictions, and therefore cannot explain same; and thus altogether
avoiding the explanation of the important material).

You may find "Part A Data Structure" of this document useful, it
eliminates confusion, and provides far more information than the
manuals, and clearly:
____http://www.softwaregems.com.au/Documents/Sybase%20GEM%20Documents/
Sybase%20Fragmentation.pdf
Feel free to go the the /Documents directory and nose around.

Regards
Derek
Loading...