Week 4- Creating Datasets
Hey Everyone! Thank you so much for coming back to my blog, we will be discussing random datasets and how we must learn to use data before we apply any real datasets.
It’s important that we don’t start off with real data because it becomes much more complicated to get started, and can easily be replicated in a more simple manner first. One of the easiest ways to do this is to either create datasets using dots or numbers. We will be discussing dot data sets this week.
To get dot data sets it requires some level of coding, the most simple one to generate is the truly random data set. The command is “(random(), random())” and would create a random assortment of dots with no uniformity or normal pattern, leading to some areas becoming more dense with dots than others. This is good if you want to create a non-representative dataset to play with. There are other methods that have more of a pattern to them but are more complex to create.
This is due to these commands using something called sampling. Sampling determines where we shoot rays or determine the frequency and intensity at that point in the program. Things as simple as resizing an image will require sampling. Samples should be consistent and uniform to avoid gaps but repeating the same program will lead to aliasing. The simple approximation is Mitchell’s best candidate algorithm as it produces a random distribution but has issues with both oversampling and undersampling. Each new sample will create a number of candidates to choose from and it will consider the best candidate as the one farthest away from the previous sample to maintain distance. This creates a good level of uniformity as each dot must maintain the same level of distance from all other surrounding dots.
function sample() {
var bestCandidate, bestDistance = 0;
for (var i = 0; i < numCandidates; ++i) {
var c = [Math.random() * width, Math.random() * height],
d = distance(findClosest(samples, c), c);
if (d > bestDistance) {
bestDistance = d;
bestCandidate = c;
}
}
return bestCandidate;
The “NumCandidiates” command allows us to determine how many candidates it creates, the larger the number then the more reliable the sampling is. The “findClosest” command will return the closest sample to the current candidate.
function sample() {
return [random() * width, random() * height];
}
This code will attempt to do the same thing but has horrible oversampling and under sampling leading to large gaps and overlapping of dots. Similar to the first random command mentioned but with a weak attempt to maintain distance.
Parameters:
- Area where points should be generated, n=2 for our purposes
- r- minimum distance between 2 points
- K- amount of attempts to create a new point
Grid cell size: r/sqrt(n) so it can only hold one sample per cell. -1 will represent no sample
Bridson Command:
Step 1:
Select an initial domain and initialize the grid with -1 in each cell.
Step 2:
Generate an initial point and set the grid cell to 0
Step 3:
Then generate up to k points in the annulus of r to 2r
Step 4:
Check that every point is adequately far enough from an existing sample, add to the active list if it is
Comments:
All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.