Just another WordPress.com site

Archive for the ‘Select Query’ Category

mysql: Create external table using CSV engine and file

One thing that the CSV engine allows for is the instantaneous data transfer from flat files to information accessible via SQL in MySQL. For example, imagine having the need to load a file containing 5 million records into a MySQL table that uses the MyISAM storage engine:

mysql> desc client_detail;
| Field                 | Type          | Null | Key | Default | Extra |
| client_transaction_id | decimal(22,0) | YES  |     | NULL    |       |
| transaction_timestamp | datetime      | YES  |     | NULL    |       |
| transaction_comment   | varchar(30)   | YES  |     | NULL    |       |
3 rows in set (0.02 sec)

mysql> load data infile '/usr/local/mysql519/data/gim/flatdata.dat' 
    ->  into table client_detail
    ->  fields terminated by ',';
Query OK, 5023575 rows affected, 0 warnings (27.38 sec)
Records: 5023575  Deleted: 0  Skipped: 0  Warnings: 0

While MyISAM accepts the data pretty quickly, you can make all the data in the flat file instantly available to MySQL in the following manner. First, create a CSV table that mirrors the format of the incoming flat file:

mysql> create table client_detail_csv (client_transaction_id decimal(22,0),
    -> transaction_timestamp datetime, transaction_comment varchar(30))
    -> engine=csv;
Query OK, 0 rows affected (0.01 sec)

Then, go to the operating system and simply rename the operating system flat file to the file MySQL created for the new CSV table:

[mysql@linux1 ~]$ cd /usr/local/mysql519/data/gim

[mysql@linux1 gim]$ ls -l
total 785848
-rw-rw----  1 mysql mysql        35 May  1 06:54 client_detail_csv.CSM
-rw-rw----  1 mysql mysql         0 May  1 06:54 client_detail_csv.CSV
-rw-rw----  1 mysql mysql      8718 May  1 06:54 client_detail_csv.frm
-rw-rw----  1 mysql mysql      8718 May  1 05:56 client_detail.frm
-rw-rw----  1 mysql mysql 221037300 May  1 06:00 client_detail.MYD
-rw-rw----  1 mysql mysql      1024 May  1 06:00 client_detail.MYI
-rw-rw----  1 mysql mysql 291367350 May  1 05:55 flatdata.dat

[mysql@linux1 gim]$ mv flatdata.dat client_detail_csv.CSV

The rename operation occurs instantly as nothing data-wise changes at the operating system level. At that point, MySQL has all the data available to it:

mysql> flush tables;
Query OK, 0 rows affected (0.00 sec)
mysql> select count(*) from client_detail_csv;
| count(*) |
|  5023575 |
1 row in set (16.17 sec)

Thus, you’ve effectively loaded 5 million records instantaneously into MySQL. The great thing is the same effect would be experienced if the file contained 10 million, 20 million, or 100 million records.

Oracle : number check in database

if ( (LENGTH(TRIM(TRANSLATE(contentcode, ‘+0123456789′,’ ‘))) is not null) or (LENGTH(TRIM(TRANSLATE(otherfield, ‘+0123456789′,’ ‘))) is not null)) then
dbms_output.put_line(‘contentcode or otherfield is invalid’);

oracle : external table

1. in browser type url ip-Of-Server:1158/em/console/

  1. in browser type url ip-Of-Server:1158/em/console/
  2. login there
  3. go to Schema-> Directory Objects
  4. click the “create” Button
  5. fill in the form
  6. * Name DIR_WLIST
    * Path

    where Name= name of virtual directory and Path= path to the directory

  7. now thru ftp client (or putty ) create a file test.txt in /tmp/test_v_dir/
  8. then go to sqlplus
  9. type in follwing query
  10. create table XTERN_test( student_name varchar2(100) )
    organization external (
    type oracle_loader
    default directory DIR_WLIST
    access parameters (
    records delimited by newline
    location (‘test.txt’)
    reject limit unlimited;

How to find duplicate rows with SQL

How to find duplicate rows with SQL

This article shows how to find duplicated rows in a database table. This is a very common beginner question. The basic technique is straightforward. I’ll also show some variations, such as how to find “duplicates in two columns” (a recent question on the #mysql IRC channel).

How to find duplicated rows

The first step is to define what exactly makes a row a duplicate of another row. Most of the time this is easy: they have the same value in some column. I’ll take this as a working definition for this article, but you may need to alter the queries below if your notion of “duplicate” is more complicated.

For this article, I’ll use this sample data:

create table test(id int not null primary key, day date not null);

insert into test(id, day) values(1, '2006-10-08');
insert into test(id, day) values(2, '2006-10-08');
insert into test(id, day) values(3, '2006-10-09');

select * from test;
| id | day        |
|  1 | 2006-10-08 |
|  2 | 2006-10-08 |
|  3 | 2006-10-09 |

The first two rows have the same value in the day column, so if I consider those to be duplicates, here’s a query to find them. The query uses a GROUP BY clause to put all the rows with the same day value into one “group” and then count the size of the group:

 select day, count(*) from test GROUP BY day;
| day        | count(*) |
| 2006-10-08 |        2 |
| 2006-10-09 |        1 |

The duplicated rows have a count greater than one. If you only want to see rows that are duplicated, you need to use a HAVING clause (not a WHERE clause), like this:

select day, count(*) from test group by day HAVING count(*) > 1;
| day        | count(*) |
| 2006-10-08 |        2 |

This is the basic technique: group by the column that contains duplicates, and show only those groups having more than one row.

Why can’t you use a WHERE clause?

A WHERE clause filters the rows before they are grouped together. A HAVING clause filters them after grouping. That’s why you can’t use a WHERE clause in the above query.

How to delete duplicate rows

A related question is how to delete the ‘duplicate’ rows once you find them. A common task when cleaning up bad data is to delete all but one of the duplicates, so you can put proper indexes and primary keys on the table, and prevent duplicates from getting into the table again.

Again, the first thing to do is make sure your definition is clear. Exactly which row do you want to keep? The ‘first’ one? The one with the largest value of some column? For this article, I’ll assume you want to keep the ‘first’ row — the one with the smallest value of the id column. That means you want to delete every other row.

Probably the easiest way to do this is with a temporary table. Especially in MySQL, there are some restrictions about selecting from a table and updating it in the same query. You can get around these, as I explain in my article How to select from an update target in MySQL, but I’ll just avoid these complications and use a temporary table.

The exact definition of the task is to delete every row that has a duplicate, except the row with the minimal value of id for that group. So you need to find not only the rows where there’s more than one in the group, you also need to find the row you want to keep. You can do that with the MIN() function. Here are some queries to create the temporary table and find the data you need to do the DELETE:

create temporary table to_delete (day date not null, min_id int not null);

insert into to_delete(day, min_id)
   select day, MIN(id) from test group by day having count(*) > 1;

select * from to_delete;
| day        | min_id |
| 2006-10-08 |      1 |

Now that you have this data, you can proceed to delete the ‘bad’ rows. There are many ways to do this, and some are better than others (see my article about many-to-one problems in SQL), but again I’ll avoid the finer points and just show you a standard syntax that ought to work in any RDBMS that supports subqueries:

delete from test
   where exists(
      select * from to_delete
      where to_delete.day = test.day and to_delete.min_id <> test.id

If your RDBMS does not support subqueries, or if it’s more efficient, you may wish to do a multi-table delete. The syntax for this varies between systems, so you need to consult your system’s documentation. You may also need to do all of this in a transaction to avoid other users changing the data while you’re working, if that’s a concern.

How to find duplicates in multiple columns

Someone recently asked a question similar to this on the #mysql IRC channel:

I have a table with columns b and c that links two other tables b and c, and I want to find all rows that have duplicates in either b or c.

It was difficult to understand exactly what this meant, but after some conversation I grasped it: the person wanted to be able to put unique indexes on columns b and c separately.

It’s pretty easy to find rows with duplicate values in one or the other column, as I showed you above: just group by that column and count the group size. And it’s easy to find entire rows that are exact duplicates of other rows: just group by as many columns as you need. But it’s harder to identify rows that have either a duplicated b value or a duplicated c value. Take the following sample table, which is roughly what the person described:

create table a_b_c(
   a int not null primary key auto_increment,
   b int,
   c int

insert into a_b_c(b,c) values (1, 1);
insert into a_b_c(b,c) values (1, 2);
insert into a_b_c(b,c) values (1, 3);
insert into a_b_c(b,c) values (2, 1);
insert into a_b_c(b,c) values (2, 2);
insert into a_b_c(b,c) values (2, 3);
insert into a_b_c(b,c) values (3, 1);
insert into a_b_c(b,c) values (3, 2);
insert into a_b_c(b,c) values (3, 3);

Now, you can easily see there are some ‘duplicate’ rows in this table, but no two rows actually have the same tuple {b, c}. That’s why this is a bit more difficult to solve.

Queries that don’t work

If you group by two columns together, you’ll get various results depending on how you group and count. This is where the IRC user was getting stumped. Sometimes queries would find some duplicates but not others. Here are some of the things this person tried:

select b, c, count(*) from a_b_c
group by b, c
having count(distinct b > 1)
   or count(distinct c > 1);

This query returns every row in the table, with a COUNT(*) of 1, which seems to be wrong behavior, but it’s actually not. Why? Because the > 1 is inside the COUNT(). It’s pretty easy to miss, but this query is actually the same as

select b, c, count(*) from a_b_c
group by b, c
having count(1)
   or count(1);

Why? Because (b > 1) is a boolean expression. That’s not what you want at all. You want

select b, c, count(*) from a_b_c
group by b, c
having count(distinct b) > 1
   or count(distinct c) > 1;

This returns zero rows, of course, because there are no duplicate {b, c} tuples. The person tried many other combinations of HAVING clauses and ORs and ANDs, grouping by one column and counting the other, and so forth:

select b, count(*) from a_b_c group by b having count(distinct c) > 1;
| b    | count(*) |
|    1 |        3 |
|    2 |        3 |
|    3 |        3 |

Nothing found all the duplicates, though. What I think made it most frustrating is that it partially worked, making the person think it was almost the right query… perhaps just another variation would get it…

In fact, it’s impossible to do with this type of simple GROUP BY query. Why is this? It’s because when you group by one column, you distribute like values of the other column across multiple groups. You can see this visually by ordering by those columns, which is what grouping does. First, order by column b and see how they are grouped:

When you order (group) by column b, the duplicate values in column c are distributed into different groups, so you can’t count them with COUNT(DISTINCT c) as the person was trying to do. Aggregate functions such as COUNT() only operate within a group, and have no access to rows that are placed in other groups. Similarly, when you order by c, the duplicate values in column b are distributed into different groups. It is not possible to make this query do what’s desired.

Some correct solutions

Probably the simplest solution is to find the duplicates for each column separately and UNION them together, like this:

select b as value, count(*) as cnt, 'b' as what_col
 from a_b_c group by b having count(*) > 1
 select c as value, count(*) as cnt, 'c' as what_col
 from a_b_c group by c having count(*) > 1;
| value | cnt | what_col |
|     1 |   3 | b        |
|     2 |   3 | b        |
|     3 |   3 | b        |
|     1 |   3 | c        |
|     2 |   3 | c        |
|     3 |   3 | c        |

The what_col column in the output indicates what column the duplicate value was found in. Another approach is to use subqueries:

select a, b, c from a_b_c
 where b in (select b from a_b_c group by b having count(*) > 1)
    or c in (select c from a_b_c group by c having count(*) > 1);
| a  | b    | c    |
|  7 |    1 |    1 |
|  8 |    1 |    2 |
|  9 |    1 |    3 |
| 10 |    2 |    1 |
| 11 |    2 |    2 |
| 12 |    2 |    3 |
| 13 |    3 |    1 |
| 14 |    3 |    2 |
| 15 |    3 |    3 |

This is probably much less efficient than the UNION approach, and will show every duplicated row, not just the values that are duplicated. Still another approach is to do self-joins against grouped subqueries in the FROM clause. This is more complicated to write correctly, but might be necessary for some complex data, or for efficiency:

select a, a_b_c.b, a_b_c.c
from a_b_c
   left outer join (
      select b from a_b_c group by b having count(*) > 1
   ) as b on a_b_c.b = b.b
   left outer join (
      select c from a_b_c group by c having count(*) > 1
   ) as c on a_b_c.c = c.c
where b.b is not null or c.c is not null

Any of these queries will do, and I’m sure there are other ways too. If you can use UNION, it’s probably the easiest.

Some small and interesting queries

Finding Duplicates with SQL

Here’s a handy query for finding duplicates in a table. Suppose you want to find all email addresses in a table that exist more than once:

SELECT email,
 COUNT(email) AS NumOccurrences
FROM users
GROUP BY email
HAVING ( COUNT(email) > 1 )

You could also use this technique to find rows that occur exactly once:

SELECT email
FROM users
GROUP BY email
HAVING ( COUNT(email) = 1 )

Modify Columns Size

Here’s a handy query for finding duplicates in a table. Suppose you want to find all email addresses in a table that exist more than once:

Alter table table name Modify (column-name datatype)
Alter table user Modify (name varchar2(200));
%d bloggers like this: