Discussion:
IYHO: fastest way to copy a multi-million rows table?
(too old to reply)
Tester
2009-11-10 20:11:16 UTC
Permalink
Bypassing the disk.

Something like a batch select into dest_db..dest_table from
src_db..src_table or insert select.
Anything that does not involve excessive logging and storing bcp output on
disk.
Leonid Gvirtz
2009-11-11 10:46:19 UTC
Permalink
Post by Tester
Bypassing the disk.
Something like a batch select into dest_db..dest_table from
src_db..src_table or insert select.
Anything that does not involve excessive logging and storing bcp output on
disk.
Hi

Assuming that you need to move the data between different dataservers,
you can use CIS for this purpose. Create a proxy table in the
destination database that points to your source table in the source
dataserver. Then, you will be able to run SELECT INTO statement into a
new table in the destination dataserver from the proxy table with
minimal logging. You may want to tune some CIS-related parameters on
source and destination dataservers for better performance, such as
"cis packet size", "cis bulk insert array size", "cis cursor rows"
etc.

Hope it helps
Leonid Gvirtz
http://www.gvirtz-consulting.com
Axel
2009-11-18 00:45:54 UTC
Permalink
SELECT INTO does look like the fastest method.
But in our case, for instance, we have a table with approx. 150M rows. While
SELECT INTO took roughly 3 hours, it's the subsequent build of clustered
index that times out.

Is there a way to boost that, maybe run something in parallel? Partitioning
this table is not an open option at this time, since we are prohibited to
change the model by the vendor.
Post by Tester
Bypassing the disk.
Something like a batch select into dest_db..dest_table from
src_db..src_table or insert select.
Anything that does not involve excessive logging and storing bcp output on
disk.
Hi

Assuming that you need to move the data between different dataservers,
you can use CIS for this purpose. Create a proxy table in the
destination database that points to your source table in the source
dataserver. Then, you will be able to run SELECT INTO statement into a
new table in the destination dataserver from the proxy table with
minimal logging. You may want to tune some CIS-related parameters on
source and destination dataservers for better performance, such as
"cis packet size", "cis bulk insert array size", "cis cursor rows"
etc.

Hope it helps
Leonid Gvirtz
http://www.gvirtz-consulting.com
Leonid Gvirtz
2009-11-18 20:00:25 UTC
Permalink
Hi Axel

Did you try WITH SORTED_DATA option for creation of your clustered
index? If not - then it may be the way. Add ORDER BY clause according
to the clustered index fields to your SELECT INTO statement. Then, you
will be able to skip the data copying during the clustered index
creation by using WITH SORTED_DATA option since the data in the
resulting table is already sorted. See this link for more details:
http://infocenter.sybase.com/help/topic/com.sybase.infocenter.dc36272.1502/html/commands/X58868.htm#X24774

Hope it helps
Leonid Gvirtz
http://www.gvirtz-consulting.com
Post by Axel
SELECT INTO does look like the fastest method.
But in our case, for instance, we have a table with approx. 150M rows. While
SELECT INTO took roughly 3 hours, it's the subsequent build of clustered
index that times out.
Is there a way to boost that, maybe run something in parallel? Partitioning
this table is not an open option at this time, since we are prohibited to
change the model by the vendor.
Post by Tester
Bypassing the disk.
Something like a batch select into dest_db..dest_table from
src_db..src_table or insert select.
Anything that does not involve excessive logging and storing bcp output on
disk.
Hi
Assuming that you need to move the data between different dataservers,
you can use CIS for this purpose. Create a proxy table in the
destination database that points to your source table in the source
dataserver. Then, you will be able to run SELECT INTO statement into a
new table in the destination dataserver from the proxy table with
minimal logging. You may want to tune some CIS-related parameters on
source and destination dataservers for better performance, such as
"cis packet size", "cis bulk insert array size", "cis cursor rows"
etc.
Hope it helps
Leonid Gvirtzhttp://www.gvirtz-consulting.com
Axel
2009-11-18 22:02:50 UTC
Permalink
That's what I was thinking exactly, however, this seem to completely fill
the logsegment in tempdb unless we make it the size of this enormous table
we got. This happens during the SELECT INTO FROM src_table ORDER BY step.


"Leonid Gvirtz" <***@yahoo.com> wrote in message news:670903cc-2e4d-47ae-85ff-***@d5g2000yqm.googlegroups.com...
Hi Axel

Did you try WITH SORTED_DATA option for creation of your clustered
index? If not - then it may be the way. Add ORDER BY clause according
to the clustered index fields to your SELECT INTO statement. Then, you
will be able to skip the data copying during the clustered index
creation by using WITH SORTED_DATA option since the data in the
resulting table is already sorted. See this link for more details:
http://infocenter.sybase.com/help/topic/com.sybase.infocenter.dc36272.1502/html/commands/X58868.htm#X24774

Hope it helps
Leonid Gvirtz
http://www.gvirtz-consulting.com
Post by Axel
SELECT INTO does look like the fastest method.
But in our case, for instance, we have a table with approx. 150M rows. While
SELECT INTO took roughly 3 hours, it's the subsequent build of clustered
index that times out.
Is there a way to boost that, maybe run something in parallel?
Partitioning
this table is not an open option at this time, since we are prohibited to
change the model by the vendor.
Post by Tester
Bypassing the disk.
Something like a batch select into dest_db..dest_table from
src_db..src_table or insert select.
Anything that does not involve excessive logging and storing bcp output on
disk.
Hi
Assuming that you need to move the data between different dataservers,
you can use CIS for this purpose. Create a proxy table in the
destination database that points to your source table in the source
dataserver. Then, you will be able to run SELECT INTO statement into a
new table in the destination dataserver from the proxy table with
minimal logging. You may want to tune some CIS-related parameters on
source and destination dataservers for better performance, such as
"cis packet size", "cis bulk insert array size", "cis cursor rows"
etc.
Hope it helps
Leonid Gvirtzhttp://www.gvirtz-consulting.com
Mark A. Parsons
2009-11-18 22:20:26 UTC
Permalink
If the source table has an index identical to the desired clustered index for the new table, consider having the
SELECT/INTO use said index (force it if necessary) from the source table.

Then use the 'with sorted_data' option when creating the clustered index on the new table.

Upside: should eliminate the overhead of sorting the table (either via select/into/order-by or create/index)

Downside: SELECT could take longer to run and require more physical/logical IOs, especially if having to make 150
million hops from index leaf pages to data pages

Obviously (?):

- if the source table is APL with a clustered index, then insuring the SELECT uses said clustered index should be
(relatively) quick *and* allow for the use of the 'with sorted_data' option

- use of parallel/worker threads for the SELECT could cause the data to be scrambled durint the INSERT step, thus
negating the ability to use the 'with sorted_data' option
Post by Axel
That's what I was thinking exactly, however, this seem to completely fill
the logsegment in tempdb unless we make it the size of this enormous table
we got. This happens during the SELECT INTO FROM src_table ORDER BY step.
Hi Axel
Did you try WITH SORTED_DATA option for creation of your clustered
index? If not - then it may be the way. Add ORDER BY clause according
to the clustered index fields to your SELECT INTO statement. Then, you
will be able to skip the data copying during the clustered index
creation by using WITH SORTED_DATA option since the data in the
http://infocenter.sybase.com/help/topic/com.sybase.infocenter.dc36272.1502/html/commands/X58868.htm#X24774
Hope it helps
Leonid Gvirtz
http://www.gvirtz-consulting.com
Post by Axel
SELECT INTO does look like the fastest method.
But in our case, for instance, we have a table with approx. 150M rows. While
SELECT INTO took roughly 3 hours, it's the subsequent build of clustered
index that times out.
Is there a way to boost that, maybe run something in parallel? Partitioning
this table is not an open option at this time, since we are prohibited to
change the model by the vendor.
Post by Tester
Bypassing the disk.
Something like a batch select into dest_db..dest_table from
src_db..src_table or insert select.
Anything that does not involve excessive logging and storing bcp output on
disk.
Hi
Assuming that you need to move the data between different dataservers,
you can use CIS for this purpose. Create a proxy table in the
destination database that points to your source table in the source
dataserver. Then, you will be able to run SELECT INTO statement into a
new table in the destination dataserver from the proxy table with
minimal logging. You may want to tune some CIS-related parameters on
source and destination dataservers for better performance, such as
"cis packet size", "cis bulk insert array size", "cis cursor rows"
etc.
Hope it helps
Leonid Gvirtzhttp://www.gvirtz-consulting.com
Mark A. Parsons
2009-11-18 22:36:43 UTC
Permalink
Instead of:

if the source table is APL with a clustered index

Should have been:

if the source table is APL with the same desired clustered index
Post by Mark A. Parsons
If the source table has an index identical to the desired clustered
index for the new table, consider having the SELECT/INTO use said index
(force it if necessary) from the source table.
Then use the 'with sorted_data' option when creating the clustered index on the new table.
Upside: should eliminate the overhead of sorting the table (either via
select/into/order-by or create/index)
Downside: SELECT could take longer to run and require more
physical/logical IOs, especially if having to make 150 million hops from
index leaf pages to data pages
- if the source table is APL with a clustered index, then insuring the
SELECT uses said clustered index should be (relatively) quick *and*
allow for the use of the 'with sorted_data' option
- use of parallel/worker threads for the SELECT could cause the data to
be scrambled durint the INSERT step, thus negating the ability to use
the 'with sorted_data' option
Post by Axel
That's what I was thinking exactly, however, this seem to completely
fill the logsegment in tempdb unless we make it the size of this
enormous table we got. This happens during the SELECT INTO FROM
src_table ORDER BY step.
Hi Axel
Did you try WITH SORTED_DATA option for creation of your clustered
index? If not - then it may be the way. Add ORDER BY clause according
to the clustered index fields to your SELECT INTO statement. Then, you
will be able to skip the data copying during the clustered index
creation by using WITH SORTED_DATA option since the data in the
http://infocenter.sybase.com/help/topic/com.sybase.infocenter.dc36272.1502/html/commands/X58868.htm#X24774
Hope it helps
Leonid Gvirtz
http://www.gvirtz-consulting.com
Post by Axel
SELECT INTO does look like the fastest method.
But in our case, for instance, we have a table with approx. 150M rows. While
SELECT INTO took roughly 3 hours, it's the subsequent build of clustered
index that times out.
Is there a way to boost that, maybe run something in parallel? Partitioning
this table is not an open option at this time, since we are
prohibited to
change the model by the vendor.
Post by Tester
Bypassing the disk.
Something like a batch select into dest_db..dest_table from
src_db..src_table or insert select.
Anything that does not involve excessive logging and storing bcp output on
disk.
Hi
Assuming that you need to move the data between different dataservers,
you can use CIS for this purpose. Create a proxy table in the
destination database that points to your source table in the source
dataserver. Then, you will be able to run SELECT INTO statement into a
new table in the destination dataserver from the proxy table with
minimal logging. You may want to tune some CIS-related parameters on
source and destination dataservers for better performance, such as
"cis packet size", "cis bulk insert array size", "cis cursor rows"
etc.
Hope it helps
Leonid Gvirtzhttp://www.gvirtz-consulting.com
Leonid Gvirtz
2009-11-19 11:51:31 UTC
Permalink
Hi

You can use CIS trace flags (11201-11207, server-wide) in order to see
the actual query text which is sent to the source dataserver. Then,
you can check the execution plan on the source dataserver, watch for
Worktable(s) creation. I suspect that Worktables used by the query
fills up your tempdb on the source dataserver. If you have the same
clustered index on the source table then ORDER BY by index fields
should not cause Worktable creation.

Hope it helps
Leonid Gvirtz
http://www.gvirtz-consulting.com


On Nov 19, 12:36 am, "Mark A. Parsons"
     if the source table is APL with a clustered index
     if the source table is APL with the same desired clustered index
Post by Mark A. Parsons
If the source table has an index identical to the desired clustered
index for the new table, consider having the SELECT/INTO use said index
(force it if necessary) from the source table.
Then use the 'with sorted_data' option when creating the clustered index
on the new table.
Upside:  should eliminate the overhead of sorting the table (either via
select/into/order-by or create/index)
Downside:  SELECT could take longer to run and require more
physical/logical IOs, especially if having to make 150 million hops from
index leaf pages to data pages
- if the source table is APL with a clustered index, then insuring the
SELECT uses said clustered index should be (relatively) quick *and*
allow for the use of the 'with sorted_data' option
- use of parallel/worker threads for the SELECT could cause the data to
be scrambled durint the INSERT step, thus negating the ability to use
the 'with sorted_data' option
Post by Axel
That's what I was thinking exactly, however, this seem to completely
fill the logsegment in tempdb unless we make it the size of this
enormous table we got. This happens during the SELECT INTO FROM
src_table ORDER BY step.
Hi Axel
Did you try WITH SORTED_DATA option for creation of your clustered
index? If not - then it may be the way. Add ORDER BY clause according
to the clustered index fields to your SELECT INTO statement. Then, you
will be able to skip the data copying during the clustered index
creation by using WITH SORTED_DATA option since the data in the
http://infocenter.sybase.com/help/topic/com.sybase.infocenter.dc36272...
Hope it helps
Leonid Gvirtz
http://www.gvirtz-consulting.com
Post by Axel
SELECT INTO does look like the fastest method.
But in our case, for instance, we have a table with approx. 150M rows. While
SELECT INTO took roughly 3 hours, it's the subsequent build of clustered
index that times out.
Is there a way to boost that, maybe run something in parallel? Partitioning
this table is not an open option at this time, since we are prohibited to
change the model by the vendor.
Post by Tester
Bypassing the disk.
Something like a batch select into dest_db..dest_table from
src_db..src_table or insert select.
Anything that does not involve excessive logging and storing bcp output on
disk.
Hi
Assuming that you need to move the data between different dataservers,
you can use CIS for this purpose. Create a proxy table in the
destination database that points to your source table in the source
dataserver. Then, you will be able to run SELECT INTO statement into a
new table in the destination dataserver from the proxy table with
minimal logging. You may want to tune some CIS-related parameters on
source and destination dataservers for better performance, such as
"cis packet size", "cis bulk insert array size", "cis cursor rows"
etc.
Hope it helps
Leonid Gvirtzhttp://www.gvirtz-consulting.com
Keith
2009-11-12 16:53:36 UTC
Permalink
Post by Tester
Bypassing the disk.
Something like a batch select into dest_db..dest_table from
src_db..src_table or insert select.
Anything that does not involve excessive logging and storing bcp output on
disk.
Or if you can't bypass the disk completely and don't have the CIS
plugin, I've had good results on some O/S's by doing this:

1) Create a FIFO (named pipe).
2) Start up a bcp in from the FIFO.
3) Run a bcp out to the FIFO.

Keith
joop
2009-11-13 20:05:09 UTC
Permalink
Post by Keith
Post by Tester
Bypassing the disk.
Something like a batch select into dest_db..dest_table from
src_db..src_table or insert select.
Anything that does not involve excessive logging and storing bcp output on
disk.
Or if you can't bypass the disk completely and don't have the CIS
1) Create a FIFO (named pipe).
2) Start up a bcp in from the FIFO.
3) Run a bcp out to the FIFO.
Keith
and if you go across macnines,

pipe the output from the prev post into ssh
and that into a fifo again
and a bcp out of that pipe
Loading...