On Oct 20, 8:09 am, 1dg618 <***@gmail.com> wrote:
You are asking a very important question which relates to many aspects
of table/index design, and the resulting performance of queries, so
let me be clear.
SomeTable (
Post by 1dg618Date
Author
Book -- Title
Sex
ISIN
Post by 1dg618junk1
junk2
junk3
PRIMARY KEY CLUSTERED (Date, Author, Book)
)
LOCK ALLPAGES
Post by 1dg618A clustered index is a good idea for example when a date is involved?
First a qualifier. For the first 13 or so years of Sybase ASE, we had
only tables which we now know to be APL lockscheme, and Clustered
Index meant one thing: the leaf level of the index is *clustered* with
the data row. That resulted in the data being maintained in index
order; one less physical fetch on every access; a very fast Index;
pages & extents that were kept trim; supported Range Queries (which is
your question); etc.
With the advent of ASE 12, they introduced DPL/DRL lockschemes. These
have no Clustered Index. They have a Placement Index which is really
a Nonclustered Index (separate physical DataStructure) with two
additional qualities: it *initially* places the rows in index order
(which cannot be maintained under normal insert/update/delete, row
expansion, etc), and lives on the same segment as the data heap, in
the hope that the PI will be "close" to the heap (which is moot
because, given concurrent inserts to all table and their indices, the
extents for all objects are interspersed). These tables introduce a
new form of fragmentation, and you need to REORG them weekly
(currently not an online operation). There is no clustering of index
+data. It does not support Range Queries. The advantage is that the
granularity of the lock is smaller (but depending on other aspects of
your design, you may end up with much more locks).
For some reason unknown to me, and not explicitly declared by Sybase,
they did not change the syntax to provide for the Placement Index,
thus to refer to them you have to use "clustered index". This
obviously creates a lot of confusion. It creates upset (just two eg)
when people find out that their carefully designed "clustered index"
does not support the range queries that they designed it for; or that
the rows are not in index order and ASE is performing table scans when
they explicitly did everything to invoke an index scan. In the
manuals, sometimes they are called Placement and other times
"clustered".
I have been a Sybase performance specialist for 20 years and a Sybase
Partner for 13 years, ending last year. I informed Sybase of this
misrepresentation and the confusion it causes, which is ongoing.
Nothing has come of it. In fact the evangelists heavily push the
notion that a Placement Index is "the same as" a Clustered Index. I
got barred from the company newsgroup (which only have employees
posting now, all the independent contributors are gone). Bret is a
venerable and respectable Sybase engineer, but he is an employee, you
will not get this clarity from them. For my part, as a consultant and
trainer, it does not work to use confused or dishonest terms, and
forever be explaining that sometimes an apple means an apple but other
times, it means two oranges. I cannot participate in the ongoing
misrepresentation, or imply implications which are know to be false,
therefore:
- I use the term Clustered Index for Clustered Index
- I use the term Placement Index for Placement Index
- Since there is no such thing as a "clustered index" for DPL/DRL
tables, I do not use that term for such tables
- I do not use confused terms, or propagate the confusion in the
manuals
Back to your question. As long as you are using a Clustered Index,
ie. your table is APL not DPL/DRL, then yes, it supports Range
Queries. If your table is DPL/DRL, then despite the term "clustered"
in the command, you do not have a Clustered Index, and therefore there
is nothing to support Range Queries.
In your question, if the Date is involved, or the Date + Author is
involved, it is a Range Query.
You are also implicitly dealing with another important aspect. Data
distribution. You have done a nice job, and come up with a typical
Relational or Natural Key. APL tables and their Clustered Indices are
designed for that. The data is the Naturally distributed across the
CI (which *is* the data); inserts are spread; there is no hotspot such
as there always is with an sequentially increasing single-column
"key"; select operations are nicely constrained to small ranges, and
do not interfere with each other; etc; etc; etc. But the tables/CIs
are for the old school who produce real Relational databases; DPL/DRL
are designed for, and more appropriate for, the record filing systems
placed in database containers.
Post by 1dg618For example, if I have 29 million records going back to 1980 and I
want to find a book from 1997 by a particular author a clustered index
with the date first will quickly get to the 1997 records then quickly
to the author then it might have to scan for the book? I think, the
primary key clustered index will find the record pretty quick?
Yes, for the genuine Clustered Index (ie. not for any Nonclustered
Index, including Placement Index).
I would recommend a small change, which will give you even better data
distribution, and therefore even better performance (the query you
identify, as well as similar queries), selectivity, reduced scope of
interference, etc:
PRIMARY KEY CLUSTERED (Author, Date, Book)
You may notice that this supplies Bret's suggestion as well, without
an additional NCI.
Post by 1dg618Now, if I want to find the isin and I know a date range the primary
key clustered index will quickly narrow the records because of the
date but may have to scan for the isin in that date range? Is there a
way to prevent a table scan for the isin?
Again, that is applicable only where you have a genuine CI. Notice
that you are naturally using the word "range".
Many people expect that the Optimiser works like that. Wish that it
would. The ASE 15 Optimiser has been totally re-written and it is a
much more intelligent beast than it used to be. Explaining its
operation cannot be done adequately in this media. Suffice to say
that all those sorts of considerations are made, as well as many, many
others.
To answer your question, given no other index (your CI, not my
recommended CI), if the no of rows identified *by other means* by the
Date was small enough, yes, it would avoid a table scan. To be clear,
if your date was just the year portion, no; if year+month, likely; year
+month+day, definitely. My recommended CI is useless for the purpose
of your example query.
The Optimiser considers various things, notably Statistics, and makes
a decision to *either* use an index to find the rows *or* to find the
starting position for a Range Query (CI only) *or* use a table scan.
It does not use combinations of that. NCIs, which good for finding a
few qualifying rows, cannot be used to *subsequently* range scan, no,
if it is going to do a range scan, the starting point has to be
identified via the CI. A "table scan" scans the entire CI leaf/data
level via the PageChain (APL) or the entire Heap via the
ObjectAllocMap:AllocUnit:AllocPage:Extent method (DPL/DRL), until the
criteria is satisfied (eg. existence check, TOP n, whatever).
"Covered Query" is of course a different thing. All index types can
be used for a "covered query", the intent being, the query can be
satisfied by the index columns alone, the data pages are not required.
Post by 1dg618or a way to quickly find the
isin? Would a nonclustered index on isin help a query?
Absolutely, that is exactly what they are there for. And it would
eliminate the consideration of a range query.
Now please realise, you know what you are doing, ISIN is unique. You
will fetch the row with one single access (navigate B-Tree and find
one row satisfying the criteria).
The reason I state this qualifier is, many people just add NCIs to
"help" boost slow queries, and wonder why they are not used. Non-
unique NCIs are as useful as tits on a bull. It has nothing to do
with the Optimiser (which gets blamed), this applies for all products.
It has everything to do with understanding your data. The more
"unique" an index is, the more useful it is, and the more frequently
the Optimiser will choose it over a table scan. The less "unique" an
index is, the less useful it is, and the less frequently the Optimiser
will choose it over a table scan. Indices that are not "unique" are
simply a waste of space and processing power on every single insert/
delete.
- NCI on (Sex) -- useless to queries using Sex as one of their
sargs
- NCI on (Author) -- depends: useless for Celko; useful for Authors
who have published a few books
- NCI on (Date) -- depends, refer above
- NCI on (Book) -- depends: useless for books on "... for Dummies",
cookery, sex or technology; useful for books on history, the arts etc
- NCI on (Book, Sex) -- useless
- NCI on (Book, Author) -- Unique
- NCI on (Book, Date) -- useful (but probably never requested)
Post by 1dg618In plans, I see queries doing forward scans and table scans. I assume,
a forward scan uses an index to drill down to the records then table
scans for the records?
Bret has provided the general answer re direction of scans.
If you are examining Plans, then I better give you a specific answer
re "drill-down". That is a data mining concept, and implies a form
of interaction with the *levels* of the data, not an index B-Tree
navigation concept, but I will apply it as I think that is what you
mean.
CI: Note the data pages are capable of being Range Scanned. It
navigates the B-Tree to find the first qualifying row (using the first
or first+second, etc, columns of the CI), then scans the data pages.
This is not a "table scan", this is a Range Scan (it is not identified
as such in the Plan; you will see "forward scan").
NCI or DPL/DRL: It navigates the B-Tree to find all (previously
identified as small in number) rows. Period. If the no of qualifying
rows was not small, the index is useless. There is no table or index
of heap scanning after that.
The "drill-down" concept (you are correctly intuiting what is
required, what the Optimiser will require) is important, you need to
confirm that the optimiser is or is not doing what you expect it to
do. Look for "
Unfortunately I do not have a error to demonstrate, I can only
demonstrate a correct example and provide an example error. I have a
CI with four columns, and *here* the Optimiser has chosen it
perfectly:
SCAN Operator (VA = 5)
FROM TABLE
tempdb..sg_HS_Statistic_CellValue
C_VAL
Using Clustered Index.
Index : UC_PK
Forward Scan.
Positioning by key.
Keys are:
iTable ASC
iColumn ASC
iPartition ASC
iCell ASC
If it chose only three columns, or two, then I should chase that down.
The worktables it creates are very important, especially if the source
table is large. For obvious reasons, the worktables are always APL.
It creates CIs for the worktables, which are more correct when your
tables are correct. Look for "Key Count:". If the worktable created
for the above Plan fragment (example only, this query did not create a
worktable) identified "Key Count: 1", I would know that it did not
infer a four-column CI, or find my four-columns useful, and I would
chase that down.
Post by 1dg618I know I'm asking a lot of questions, but this is pretty interesting
stuff. There's a ton of informaton out there, but it's not really
straightforward.
That is right. The web is full of shallow information and
misinformation. The vendors websites naturally suit their commercial
interests, they cannot be relied upon to provide unbiased or
independently helpful information. The manuals are pathetic, written
by schizophrenics (people obsessed with concentrating on seeming
contradictions, and therefore cannot explain same; and thus altogether
avoiding the explanation of the important material).
You may find "Part A Data Structure" of this document useful, it
eliminates confusion, and provides far more information than the
manuals, and clearly:
____http://www.softwaregems.com.au/Documents/Sybase%20GEM%20Documents/
Sybase%20Fragmentation.pdf
Feel free to go the the /Documents directory and nose around.
Regards
Derek