Key Data Science

RSS
Dec
22

Data driven holidays

I’ve recently started planning my next holidays in Fuerteventura. If you’re into watersports like me you must be familiar with a pattern of rising and falling sea level with respect to land. We are talking about a tide here. In the South of Fuerteventura, there is a fantastic spot called Sotavento. It’s a 4km long sandy beach famous for its kitesurfing possibilities. It’s renowned for some of the bluest shallow flat water you’ve ever seen.

But there’s a catch, the lagoon only gets filled at high tides, and not every high tide brings enough water to fill the entire lagoon. So how do you plan your perfect holidays to ensure that the lagoon has enough water for you to have fun all week?

Well, you can start with the lunar calendar data first. The gravitational pull of the moon causes the oceans to bulge. Tides change in height – low water level and high water level vary throughout the month and year. The tides build up to a maximum and fall to a minimum twice a month. When the sun, moon and earth all line up at new or full moon then we get the highest tides which are called ‘spring’ tides (nothing to do with the time of year!).

Not all spring tides are the same size, though. Probably better to switch to tide tables at this point. Springs nearest the equinoxes (21 March and 21 September – when day and night are of equal length all over the world) are slightly bigger. This is because the earth’s orbit and tilt align with the sun and the moon in such a way that the gravity has the most effect. In conjunction with the ocean currents and the geography, the tides become much higher than normal.

Finally, there’s a calendar week. Ideally, you want the high spring tide to peak mid-week so you can enjoy the lagoon right after your arrival on Sat/Sun. You also want the high water to happen in the early afternoon hours. You generally have about 2 hours either side of high tide for kiting. This gets you some well-deserved sleep in the morning allowing for a midday session in the sun.

Now equipped with all the know-how and data picking your holiday dates seems like a walk in the park. Ohh, I almost forgot about the weather – you should factor in the wind forecast which varies considerably according to seasons – no wind no fun 🙂

Uncategorised Comments Off on Data driven holidays
Jun
11

Round numbers are always false*

I’ve been refreshing my knowledge of statistic recently. It’s been a while since I took my university course. Aside from a more number focused books like “Statistics in Plain English” by Timothy C. Urdan. I recently picked up “How to Lie with Statistics” by Darrell Huff.

It’s an old classic, first published in 1953 and I must say one of the most informative reads I’ve had in a long time. You may find that examples covered may be a bit outdated, but the concepts are timeless and still valid.

Well, the word has changed so much but the numbers are still “manipulated” in the same ways.

Ohh, and it also has some delightful illustrations…

Do you want to spot when you are fed misleading numbers? Then this book is for you.

*Dr Samuel Johnson – British author, linguist and lexicographer

Statistics Comments Off on Round numbers are always false*
Jan
03

Are we all doomed?

I finally got around to finishing Bruce Schneier’s latest book “Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World”. Reading this book was a real eye-opener.

If you are interested in big data, privacy and what governments or big companies are doing with your data – you should read it. If you are not interested but own a mobile phone, TV or any other Internet device – you should read it.

It’s a well-researched book that documents the current situation. It’s a must-read for anyone interested in how big data relates to human power structures.

Bruce Schneier is well known for his contributions in the field of cryptography and computer security. He wrote “Applied Cryptography” which is a definitive guide on cryptography for computer programmers. He also created Blowfish and Twofish encryption algorithms. But don’t be put off by the complexity of Schneier’s previous work. The latest book is not technical and focuses mainly on the social aspects.

Yes, you can give it to your mum – and trust me she should read it as well!

Big Data, Security Comments Off on Are we all doomed?
Dec
15

Pentaho Kettle issues with FTPS

Pentaho is a swiss knife when it comes to data transfer. I often use it to collect data from various remote locations using different protocols like HTTP, FTP, FTPS or SFTP.

Every Friday I wear my security hat and work on at least one task improve the security at my place. This time I noticed that some potentially sensitive data was downloaded over the Internet using FTP. I probed the server and discovered that it also support FTPS. Switching to a more secure protocol which seemed like a quick win.

On a side note, FTPS is often confused with SFTP and vice-versa even though these protocols share nothing in common except their ability to securely transfer files. SFTP is based on the SSH protocol which is best-known for providing secure access to shell accounts on remote Unix servers.

I changed the job to use FTPS and run Pentaho. The job ran very quickly and didn’t produce any errors. Success!!?

Not quite. Upon further inspection, I noticed that there was absolutely no data processed. It seemed like the downloaded file was empty or the file wasn’t transferred at all.

I manually connected over FTPS using lsftp. The files were there and definitely not empty. Unfortunately, Pentaho doesn’t give you any options to increase the logging verbosity for the FTP client. Luckily the FTP server was local, and I was able to set it to the debugging mode by enabling using the below setting in vsftpd.conf:

xferlog_enable=YES
xferlog_std_format=NO
log_ftp_protocol=YES

Note: log_ftp_protocol — When enabled in conjunction with xferlog_enable and with xferlog_std_format set to NO, all FTP commands and responses are logged. This directive is useful for debugging.

The debug code showed an additional error which points to the require_ssl_reuse option:
Sun Mar 1 11:09:02 2015 [pid 12118] [super] FTP command: Client "195.224.x.x", "LIST /data"
Sun Mar 1 11:09:02 2015 [pid 12118] [super] FTP response: Client "195.224.x.x", "150 Here comes the directory listing."
Sun Mar 1 11:09:02 2015 [pid 12117] [super] DEBUG: Client "195.224.x.x", "No SSL session reuse on data channel."
Sun Mar 1 11:09:02 2015 [pid 12118] [super] FTP response: Client "195.224.x.x", "522 SSL connection failed; session reuse required: see require_ssl_reuse option in vsftpd.conf man page"

Note: require_ssl_reuse
If set to yes, all SSL data connections are required to exhibit SSL session reuse (which proves that they know the same master secret as the control channel). Although this is a secure default, it may break many FTP clients, so you may want to disable it.
Default: YES

After a bit more time spent on researching the issue it turned out that Apache Commons FTPS library used by Pentaho does not support the SSL session reuse behaviour; in fact, there’s an open Apache NET Jira ticket to fix this issue.

Without this option, if an attacker connects and establishes the SSL data connection before the legitimate user, they get to either steal the download or supply the upload data. The likelihood of successful exploitation especially in the internal environment is low, so I decided to disable for now.

require_ssl_reuse=NO

With the setting disabled a quick log inspection confirmed a successful download using Pentaho:

Sun Mar 1 11:12:27 2015 [pid 12259] [super] FTP command: Client "195.224.x.x", "RETR /seetickets/to_commission_monthlyreport_201701.csv"
Sun Mar 1 11:12:27 2015 [pid 12259] [super] FTP response: Client "195.224.x.x", "150 Opening BINARY mode data connection for /data/monthlyreport_201512.csv (761332 bytes)."
Sun Mar 1 11:12:27 2015 [pid 12259] [super] OK DOWNLOAD: Client "195.224.x.x", "/data/monthlyreport_201512.csv", 761332 bytes, 840.62Kbyte/sec
Sun Mar 1 11:12:27 2015 [pid 12259] [super] FTP response: Client "195.224.x.x", "226 Transfer complete."

I hope the issue will be fixed by Apache NET soon and I can revert the setting to the default and more secure one.

Pentaho Comments Off on Pentaho Kettle issues with FTPS
Nov
15

Tabelau and Email

From time to time I need to send a report to someone without the tableau online licence. If for any reason you have to do it daily this manual task can quickly become a burden.

Luckily there is a (limited) way to automate it with a bit of scripting. It gives an option of sending a pdf, png, csv for views and workbooks (not recommended from the security point of view – especially if your data is sensitive!!). A Windows box is needed for this – which is not ideal.

The first step is to download and install the tabcmd tool.

The next step is to script everything using your favourite scripting language. I went with PowerShell as this must be run on Windows. Here is a proof of concept script that does the absolute minimum:

# Clean the old report
If (Test-Path C:\pstmp\report_file.csv){
Remove-Item C:\pstmp\report_file.csv
}

# Set up a connection to Tableau Online
$command = @'
cmd.exe /C "C:\Program Files\Tableau\Tableau Server\9.0\extras\Command Line Utility\tabcmd.exe" login -s http://10ay.online.tableau.com/ -u USER -p PASSS
'@

Invoke-Expression -Command:$command

## Refresh and pull report
$command2 = @'
cmd.exe /C "C:\Program Files\Tableau\Tableau Server\9.0\extras\Command Line Utility\tabcmd.exe" get /views/path_toview/view_name.csv?:refresh=yes -f C:\pstmp\report_file.csv
'@

Invoke-Expression -Command:$command2

## Email credential
$pwd = ConvertTo-SecureString ‘PASSS’ -AsPlainText -Force
$cred = New-Object System.Management.Automation.PSCredential [email protected],$pwd

## Send the email
$param = @{
SmtpServer = 'smtp details'
Port = 587
Credential = $cred
UseSSL = $true
From = '[email protected]'
To = @('[email protected]';'[email protected]')
Subject = 'Blog daily report'
Body = "Blog daily"
Attachments = 'C:\pstmp\report_file.csv'
}

Send-MailMessage @param

Please note that this is just a PoC. Please be advised that storing username/password in a script is a security issue. Also, it’s a good idea to create a generic email address just for this purpose rather than sending it from your personal one.

Finally you will find is more documentation about tabcmd here.

Tableau Comments Off on Tabelau and Email
Oct
13

Resizing photos in bulk

I recently had to create a simple batch script to resize a massive number of images quickly. The server on which the job was going to run had 16 CPUs so wanted to ensure that all the processing power will be utilised.

I started off with Python but soon realised that there is a much easier way of doing it. The server had a minimal Linux install but surprisingly ImageMagick was there. A quick look at man pages and I came up with:

find . -name "*.[Jj][Pp][Gg]" -type f -print0 | xargs -0 -P16 -I'{}' convert -verbose -quality 100 -resize 10% {} {}

Some people install GNU Parallel for this, but there’s the -P option in xargs already. The bonus is that find and xargs are core components of virtually every Linux distro out there.

Linux , Comments Off on Resizing photos in bulk
Sep
30

All the fun that comes with BLOBs

Sometimes we have to live with huge blobs in a database. It may be due to some proprietary systems that you can’t change. On a different occasion, it’s because your developers can’t live without persisting large code objects to the database. You name it…

A Binary Large Object (BLOB) is a data type designed to store binary data in a column. This is different than most other data types used such as integers, and strings that tend to be small and manageable. Since blobs can store binary data, they can be used to store images, multimedia files, anything. When overused BLOBs lead to a massive increase in the database size and lengthy, often complicated backup and restore process.

The database dump is usually the easy part, e.g.

$mysqldump --all-databases --single-transaction --hex-blob --events --max-allowed-packet=500M > backupfile
* Note: the max-allowed-packet should be the same as the server’s max-allowed-packet.

However, upon restoring you may find that the process fails with:

$ mysql --max-allowed-packet=500M < backupfile
ERROR 2006 (HY000) at line 34596: MySQL server has gone away

You will find absolutely nothing in the MySQL log files. The first idea that comes to mind when I see this error is to increase the timeouts. Well, it helps but not necessary when BLOBs are involved. You may spend long hours increasing the timeouts and retrying, with no success.

We know on which line (statement) the restore process fails. I found a few times that the insert statement tries to write a BLOB that is bigger than the max-allowed-packet. This is despite the fact the backup was successfully taken with this exact setting.

You can run the following sed command to the extract the statement from the backup file and check its size:

$sed -n '34596p' backupfile > f_blob
$du -h f_blob
660M f_blob

Set the max-allowed-packet to a value slightly higher than the statement size and the backup will restore just fine:

$ mysql --hex-blob --max-allowed-packet=700M < backupfile

MySQL , , Comments Off on All the fun that comes with BLOBs
Aug
31

Hello world!

This is my first post

Uncategorised Comments Off on Hello world!