Week 4- Creating Datasets

Ayham A G -

Hey Everyone! Thank you so much for coming back to my blog, we will be discussing random datasets and how we must learn to use data before we apply any real datasets.

It’s important that we don’t start off with real data because it becomes much more complicated to get started, and can easily be replicated in a more simple manner first.  One of the easiest ways to do this is to either create datasets using dots or numbers. We will be discussing dot data sets this week.

To get dot data sets it requires some level of coding, the most simple one to generate is the truly random data set. The command is “(random(), random())” and would create a random assortment of dots with no uniformity or normal pattern, leading to some areas becoming more dense with dots than others. This is good if you want to create a non-representative dataset to play with. There are other methods that have more of a pattern to them but are more complex to create.

This is due to these commands using something called sampling. Sampling determines where we shoot rays or determine the frequency and intensity at that point in the program. Things as simple as resizing an image will require sampling. Samples should be consistent and uniform to avoid gaps but repeating the same program will lead to aliasing.   The simple approximation is Mitchell’s best candidate algorithm as it produces a random distribution but has issues with both oversampling and undersampling. Each new sample will create a number of candidates to choose from and it will consider the best candidate as the one farthest away from the previous sample to maintain distance. This creates a good level of uniformity as each dot must maintain the same level of distance from all other surrounding dots.

 

function sample() {

  var bestCandidate, bestDistance = 0;

  for (var i = 0; i < numCandidates; ++i) {

    var c = [Math.random() * width, Math.random() * height],

        d = distance(findClosest(samples, c), c);

    if (d > bestDistance) {

      bestDistance = d;

      bestCandidate = c;

    }

  }

  return bestCandidate;

 

The “NumCandidiates” command allows us to determine how many candidates it creates, the larger the number then the more reliable the sampling is. The “findClosest” command will return the closest sample to the current candidate.

function sample() {

  return [random() * width, random() * height];

}

This code will attempt to do the same thing but has horrible oversampling and under sampling leading to large gaps and overlapping of dots. Similar to the first random command mentioned but with a weak attempt to maintain distance.

The third algorithm is called “Bridson’s poisson disc sampling method” it is the most widely used one in this area of research. This is because Bridson’s algorithm for poisson disc sampling is even better than the best candidate algorithm for creating uniform data and images. It builds off of already existing samples instead of creating random new samples. It enforces the minimum distance constraint needed for a Poisson method. It will select an already existing candidate then create several candidates within the annulus that ranges from r to 2r. If any of the candidates are within r of an already existing sample then it is removed and all of the possible candidates are put into the model. If none are acceptable then the original sample is declared as inactive. The first image shows the result of a truly random dataset, while the second image shows the result from the best candidate algorithm. The second image is much more uniform and represents a more homogenous group with some areas of low density. The last image below represents the level of homogeneity that the algorithm can produce.
Shows examples of program results
Bridson
For the code itself, here are the rules that we must follow for it to be successful:

Parameters: 

  1. Area where points should be generated, n=2 for our purposes
  2. r- minimum distance between 2 points
  3. K- amount of attempts to create a new point

Grid cell size: r/sqrt(n) so it can only hold one sample per cell. -1 will represent no sample

Bridson Command:

Step 1:

Select an initial domain and initialize the grid with -1 in each cell.

Step 2:

Generate an initial point and set the grid cell to 0

Step 3:

Then generate up to k points in the annulus of r to 2r 

Step 4:

Check that every point is adequately far enough from an existing sample, add to the active list if it is

Thank you so much for reading this week’s blog! I hope to see you next week, where I will discuss numerical datasets using 1s and 0s and how we can make them applicable to actual state level data and trends!

More Posts

Comments:

All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.

    Juliana Locastro
    Hey Ayham! It's really cool how you got into coding this week. I'm curious to know what coding software you are using to create your random dot data sets? All the coding you did this week sounds super complicated. Could you further explain what each code you mentioned does to your random dot data sets? And could you clarify the purpose of playing around with the random dot data sets and how it relates to gerrymandering and redistricting? I can't wait to read your discussion on numerical datasets next week!

Leave a Reply

Your email address will not be published. Required fields are marked *